Neural Networks & Deep Learning
2020-Spring, 2020-Fall, 2021-Fall, 2022-Fall
This is Dr. Gan Tian. She is currently a Professor with the School of Computer Science and Technology, Shandong University.
Dr. Gan received her B.Sc. from East China Normal University in Jul. 2010. She then came to the School of Computing, National University of Singapore to pursue her PhD degree supervised by Prof. Mohan Kankanhalli from Aug. 2010 to Aug. 2015. Before coming back to China, She works at REVIVE, Institute for Infocomm Research (I2R), Agency for Science, Technology and Research (A*STAR) from Aug. 2014 to Sep. 2016.
甘甜,山东大学计算机科学与技术学院教授、博士生导师,教育部“青年长江学者”。其于2010年和2015年分别从华东师范大学和新加坡国立大学获得学士和博士学位,后任新加坡科技研究局资讯通信研究院(Institute for Infocomm Research, A*STAR)研究科学家(2014–2016)。担任CSIG多媒体专委会副秘书长,IJCAI 2023资深程序委员、ACMMM 2023领域主席、ICIMCS2019和ACMMM Asia 2020出版主席。主持国家自然科学青年基金项目、面上项目、科技创新2030-“新一代人工智能”重大项目子课题等国家级、省部级课题。在多个相关领域的国际顶级学术期刊及会议如 CVPR、ACM MM、ACL、AAAI、IEEE TIP、IEEE TMM等上发表多篇论文,获得2021年山东省科技进步一等奖、2023年山东省技术发明一等奖。
招收多媒体信息处理、计算机视觉和机器学习方向的硕士研究生和博士研究生以及博士后。
博士/硕士招生说明:
- 欢迎有一定数学基础、编程能力优异,致力于在多媒体信息处理、计算机视觉及机器学习领域开展深入研究的同学报考!
- 2026年招收多媒体信息处理、计算机视觉和机器学习方向的硕士研究生(3+,已招满)、博士研究生(2+,已意向确认1人,目前还有1个名额,欢迎联系) (更新时间:2025-12-11)
- 欢迎山东大学强基计划同学联系加入!
- 有进组意愿的同学可以将个人简历发送至gantian@sdu.edu.cn 并抄送科研助理zhangwenchao@sdu.edu.cn,邮件主题备注:申请类别+姓名+学校。
研究方向:
- 社交媒体计算:多模态社交媒体争议检测,基于多智能体的民主争议调解,基于大模型的社区传播预测等;
- 跨模态计算:短视频标签自动生成,基于自然语言描述的智能视频检索,智能音频生成等;
- 计算机视觉:无偏场景图生成,恶劣条件下的视觉感知等;
- 多模态大模型:多模态视频预训练,大模型高效部署与优化等。
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Multi-modal Large Language Models (MLLMs) have advanced significantly, offering powerful vision-language understanding capabilities. However, these models often inherit severe social biases from their training datasets, leading to unfair predictions based on attributes like race and gender. This paper addresses the issue of social biases in MLLMs by i) Introducing a comprehensive Counterfactual dataset with Multiple Social Concepts (CMSC), which provides a more diverse and extensive training set compared to existing datasets. ii) Proposing an Anti-Stereotype Debiasing strategy (ASD). Our method works by revisiting the MLLM training process, rescaling the autoregressive loss function, and improving data sampling methods to counteract biases. Through extensive experiments on various MLLMs, our CMSC dataset and ASD method demonstrate a significant reduction in social biases while maintaining the models' original performance.
Multimodal controversy detection, which involves determining whether a given video and its associated comments are controversial, plays a pivotal role in risk management on social video platforms. Existing methods typically provide only classification results, failing to identify what aspects are controversial and why, thereby lacking detailed explanations. To address this limitation, we propose a novel Agent-based Multimodal Controversy Detection architecture, termed AgentMCD. This architecture leverages Large Language Models (LLMs) as generative agents to simulate human behavior and improve explainability. AgentMCD employs a multi-aspect reasoning process, where multiple judges conduct evaluations from diverse perspectives to derive a final decision. Furthermore, a multi-agent simulation process is incorporated, wherein agents act as audiences, offering opinions and engaging in free discussions after watching videos. This hybrid framework enables comprehensive controversy evaluation and significantly enhances explainability. Experiments conducted on the MMCD dataset demonstrate that our proposed architecture outperforms existing LLM-based baselines in both high-resource and low-resource comment scenarios, while maintaining superior explainability.
With the increasing prevalence of online communities, social networks have become pivotal platforms for information propagation. However, this rise is accompanied by issues such as the spread of misinformation and online rumors. Community Level Information Pathway Prediction (CLIPP) is proposed to effectively stop the propagation of harmful information within specific communities. While progress has been made in understanding user-level propagation, there is a significant gap in addressing the CLIPP problem at the community level, particularly with regard to social context interpretation and the cold start problem in niche communities. To bridge this gap, we propose a novel model, named Community-Level Propagation Prediction with LLM enhanced Social Context Interpretation and Community Coldstart (ComPaSC3), which integrates three primary modules. The video enhancement module leverages LLMs to enrich the interpretation of multimedia content by embedding world knowledge. The community portrait building module utilizes LLMs to generate detailed community portraits for community interpretation. To tackle the community cold start problem, the dynamic commLink module links non-popular communities to the popular ones based on their portrait similarity, and dynamically updates their relationship weights. Our experimental results demonstrate that ComPaSC3 significantly improves predictive accuracy in both popular and non-popular scenarios. Particularly in non-popular communities, our approach outperforms existing state-of-the-art methods, achieving improvements of 10.00% - 15.20% in Rec@5 and 7.31% - 12.32% in NDCG@10.
Democratic mediation serves as a vital mechanism for resolving social conflicts; however, current practices encounter three critical limitations: (1) inefficient operations, wherein traditional labor-intensive mediation processes are both time-consuming and inefficient; (2) theoretical gaps, as prevailing mediation theories fail to explore the underlying causes of conflicts; and (3) inadequate analysis, with existing digital tools lacking comprehensive conflict mediation capabilities and primarily focusing on singular data types. To address these limitations, we introduce the Normative Social Simulator for Democratic Mediation, referred to as Norm Mediat. This framework is specifically designed to simulate democratic mediation, incorporating social norms. Central to this framework is the integration of normative reasoning into the mediation process, which enhances the ability to understand individuals’ intrinsic needs and identify the root causes of conflicts. The framework comprises two essential components: (1) Dynamic Multimodal Conflict Modeling (DMCM), which generates the initial dataset of conflict interactions; and (2) Norm-Aware Iterative Mediation (NAIM), which implements an iterative democratic mediation process through norm awareness. The results of our human evaluation underscore the effectiveness of our norm-driven mediation strategies. This research significantly contributes to computational social science by providing a comprehensive methodological framework for simulating democratic processes and offering a benchmark dataset for conflict resolution studies.
The user base of short video apps has experienced unprecedented growth in recent years, resulting in a significant demand for video content analysis. In particular, text-video retrieval, which aims to find the top matching videos given text descriptions from a vast video corpus, is an essential function, the primary challenge of which is to bridge the modality gap. Nevertheless, most existing approaches treat texts merely as discrete tokens and neglect their syntax structures. Moreover, the abundant spatial and temporal clues in videos are often underutilized due to the lack of interaction with text. To address these issues, we argue that using texts as guidance to focus on relevant temporal frames and spatial regions within videos is beneficial. In this paper, we propose a novel Syntax-Hierarchy-Enhanced text-video retrieval method (SHE-Net) that exploits the inherent semantic and syntax hierarchy of texts to bridge the modality gap from two perspectives. First, to facilitate a more fine-grained integration of visual content, we employ the text syntax hierarchy, which reveals the grammatical structure of text descriptions, to guide the visual representations. Second, to further enhance the multi-modal interaction and alignment, we also utilize the syntax hierarchy to guide the similarity calculation. We evaluated our method on four public text-video retrieval datasets of MSR-VTT, MSVD, DiDeMo, and ActivityNet. The experimental results and ablation studies confirm the advantages of our proposed method.
Knowledge distillation is a mainstream algorithm in model compression by transferring knowledge from the larger model (teacher) to the smaller model (student) to improve the performance of student. Despite many efforts, existing methods mainly investigate the consistency between instance-level feature representation or prediction, which neglects the category-level information and the difficulty of each sample, leading to undesirable performance. To address these issues, we propose a novel preview-based category contrastive learning method for knowledge distillation (PCKD). It first distills the structural knowledge of both instance-level feature correspondence and the relation between instance features and category centers in a contrastive learning fashion, which can explicitly optimize the category representation and explore the distinct correlation between representations of instances and categories, contributing to discriminative category centers and better classification results. Besides, we introduce a novel preview strategy to dynamically determine how much the student should learn from each sample according to their difficulty. Different from existing methods that treat all samples equally and curriculum learning that simply filters out hard samples, our method assigns a small weight for hard instances as a preview to better guide the student training. Extensive experiments on several challenging datasets, including CIFAR-100, ImageNet and Pascal VOC, demonstrate the superiority over state-of-the-art methods.
A neural Multimodal Machine Translation (MMT) system utilizes multimodal information, particularly images, to enhance traditional text-only models and achieve superior performance. However, the effectiveness of MMT heavily depends on the availability of extensive collections of bilingual parallel sentence pairs and manually annotated images, which poses a challenge due to the scarcity of such pairs. To address this issue, we propose incorporating Multimodal Knowledge Graph (MMKG) for data augmentation in MMT. By utilizing MMKG as an additional source of knowledge, we can overcome the limitations of existing sentence-image pairings. This allows us to expand the original parallel corpus and generate corresponding images, creating new synthetic data pairs that facilitate effective data augmentation. The proposed multimodal data augmentation method demonstrates significant improvements in BLEU scores on two distinct datasets. Specifically, on the IKEA dataset, the method achieves a maximum improvement of 22.00 in BLEU score, while on the Multi30k dataset with limited resources, the method shows a maximum improvement of 6.09 in BLEU score. These results indicate the efficacy and potential of the proposed method for enhancing the performance of multimodal models across diverse datasets.
大规模图像和视频数据集是驱动计算机视觉算法发展的核心要素。面向计算机视觉任务,构建大规模图像和视频数据集是一项重要但复杂的任务。基于生成对抗网络和扩散模型等数据生成方法可以可控地生成大规模、多样性的图像和视频数据,有效替代或弥补真实图像和视频数据集,为计算机视觉技术领域的发展提供了新的动力。本文在对面向计算机视觉的图像和视频数据生成与应用背景简介的基础上,首先,从以几何变换等为代表的传统数据增广和生成、以虚拟引擎和神经辐射场等为代表的基于三维渲染的数据生成方法、以生成对抗网络和扩散模型等为代表的基于深度生成模型的生成方法等3方面系统调研典型的图像和视频数据生成技术与模型;其次,梳理了典型的图像和视频数据生成技术与模型在图像增强,目标检测跟踪与姿态动作识别等个体分析,基于图像和视频的生物特征识别、人员计数与人群行为分析等群体行为分析、自动驾驶、视频生成、具身智能等典型计算机视觉相关任务中的应用;最后,分析面向计算机视觉的数据生成与应用中存在的问题,并展望未来发展趋势,以期促进图像和视频数据生成及计算机视觉技术的发展。
Social video platforms have emerged as significant channels for information dissemination, facilitating lively public discussions that often give rise to controversies. However, existing approaches to controversy detection primarily focus on textual features, which raises three key concerns: it underutilizes the potential of visual information available on social media platforms; it is ineffective when faced with incomplete or absent textual information; and the existing datasets fail to adequately address the need for comprehensive multimodal resources on social media platforms. To address these challenges, we construct a large-scale Multimodal Controversial Dataset (MMCD) in Chinese. Additionally, we propose a novel framework named Multi-view Controversy Detection (MVCD) to effectively model controversies from multiple perspectives. Through extensive experiments using state-of-the-art models on the MMCD, we demonstrate MVCD's effectiveness and potential impact.
Motivated by the superior performance of image diffusion models, more and more researchers strive to extend these models to the text-based video editing task. Nevertheless, current video editing tasks mainly suffer from the dilemma between the high fine-tuning cost and the limited generation capacity. Compared with images, we conjecture that videos necessitate more constraints to preserve the temporal consistency during editing. Towards this end, we propose EVE, a robust and Efficient zero-shot Video Editing method. Under the guidance of depth maps and temporal consistency constraints, EVE derives satisfactory video editing results with an affordable computational and time cost. Moreover, recognizing the absence of a publicly available video editing dataset for fair comparisons, we construct a new benchmark named ZVE-50 dataset. Through comprehensive experimentation, we validate that EVE achieves a satisfactory trade-off between performance and efficiency. Codebase, datasets, and video editing demos are available at https://github.com/alipay/Ant-Multi-Modal-Framework/blob/main/prj/EVE.
We present a framework for learning cross-modal video representations by directly pre-training on raw data to facilitate various downstream video-text tasks. Our main contributions lie in the pre-training framework and proxy tasks. First, based on the shortcomings of two mainstream pixel-level pre-training architectures (limited applications or less efficient), we propose Shared Network Pre-training (SNP). By employing one shared BERT-type network to refine textual and cross-modal features simultaneously, SNP is lightweight and could support various downstream applications. Second, based on the intuition that people always pay attention to several “significant words” when understanding a sentence, we propose the Significant Semantic Strengthening (S3) strategy, which includes a novel masking and matching proxy task to promote the pre-training performance. Experiments conducted on three downstream video-text tasks and six datasets demonstrate that, we establish a new state-of-the-art in pixel-level video-text pre-training; we also achieve a satisfactory balance between the pre-training efficiency and the fine-tuning performance. The codebase and pre-trained models are available at https://github.com/dongxingning/SNPS3.
This paper aims to tackle a novel task - Temporal Sentence Grounding in Streaming Videos (TSGSV). The goal of TSGSV is to evaluate the relevance between a video stream and a given sentence query. Unlike regular videos, streaming videos are acquired continuously from a particular source, and are always desired to be processed on-the-fly in many applications such as surveillance and live-stream analysis. Thus, TSGSV is challenging since it requires the model to infer without future frames and process long historical frames effectively, which is untouched in the early methods. To specifically address the above challenges, we propose two novel methods: (1) a TwinNet structure that enables the model to learn about upcoming events; and (2) a language-guided feature compressor that eliminates redundant visual frames and reinforces the frames that are relevant to the query. We conduct extensive experiments using ActivityNet Captions, TACoS, and MAD datasets. The results demonstrate the superiority of our proposed methods. A systematic ablation study also confirms their effectiveness.
Recent advancements in video-language understanding have been established on the foundation of image-text models, resulting in promising outcomes due to the shared knowledge between images and videos. However, video-language understanding presents unique challenges due to the inclusion of highly complex semantic details, which result in information redundancy, temporal dependency, and scene complexity. Current techniques have only partially tackled these issues, and our quantitative analysis indicates that some of these methods are complementary. In light of this, we propose a novel framework called RTQ (Refine, Temporal model, and Query), which addresses these challenges simultaneously. The approach involves refining redundant information within frames, modeling temporal relations among frames, and querying task-specific information from the videos. Remarkably, our model demonstrates outstanding performance even in the absence of video-language pre-training, and the results are comparable with or superior to those achieved by state-of-the-art pre-training methods.
Owing to well-designed large-scale video-text datasets, recent years have witnessed tremendous progress in video-text Pre-training. However, existing large-scale video-text datasets are mostly English-only. Though there are certain methods studying the Chinese video-text Pre-training, they pre-train their models on private datasets whose videos and text are unavailable. This lack of large-scale public datasets and benchmarks in Chinese hampers the research and downstream applications of Chinese video-text Pre-training. Towards this end, we release and benchmark CNVid-3.5M, a large-scale public cross-modal dataset containing over 3.5M Chinese video-text pairs. We summarize our contributions by three verbs, i.e., “Build”, “Filter”, and “Pretrain”: 1) To build a public Chinese video-text dataset, we collect over 4.5M videos from the Chinese websites. 2) To improve the data quality, we propose a novel method to filter out 1M weakly-paired videos, resulting in the CNVid-3.5M dataset. And 3) we benchmark CNVid-3.5M with three mainstream pixel-level Pre-training architectures. At last, we propose the Hard Sample Curriculum Learning strategy to promote the Pre-training performance. To the best of our knowledge, CNVid-3.5M is the largest public video-text dataset in Chinese, and we provide the first pixel-level benchmarks for Chinese video-text Pre-training. The dataset, codebase, and pre-trained models are available at https://github.com/CNVid/CNVid-3.5M.
The recently proposed FixMatch and FlexMatch have achieved remarkable results in the field of semi-supervised learning. But these two methods go to two extremes as FixMatch and FlexMatch use a pre-defined constant threshold for all classes and an adaptive threshold for each category, respectively. By only investigating consistency regularization, they also suffer from unstable results and indiscriminative feature representation, especially under the situation of few labeled samples. In this paper, we propose a novel CHMatch method, which can learn robust adaptive thresholds for instance-level prediction matching as well as discriminative features by contrastive hierarchical matching. We first present a memory-bank based robust threshold learning strategy to select highly-confident samples. In the meantime, we make full use of the structured information in the hierarchical labels to learn an accurate affinity graph for contrastive learning. CHMatch achieves very stable and superior results on several commonly-used benchmarks. For example, CHMatch achieves 8.44% and 9.02% error rate reduction over FlexMatch on CIFAR-100 under WRN-28-2 with only 4 and 25 labeled samples per class, respectively11Project address: https://github.com/sailist/CHMatch.
The multimodal emotion recognition in conversation task aims to predict the emotion label for a given utterance with its context and multiple modalities. Existing approaches achieve good results but also suffer from the following two limitations: 1) lacking modeling of diverse dependency ranges, i.e., long, short, and independent context-specific representations and without consideration of the different recognition difficulty for each utterance; 2) consistent treatment of the contribution for various modalities. To address the above challenges, we propose the Self-adaptive Context and Modal-interaction Modeling (SCMM) framework. We first design the context representation module, which consists of three submodules to model multiple contextual representations. Thereafter, we propose the modal-interaction module, including three interaction submodules to make full use of each modality. Finally, we come up with a self-adaptive path selection module to select an appropriate path in each module and integrate the features to obtain the final representation. Extensive experiments under four settings on three multimodal datasets, including IEMOCAP, MELD, and MOSEI, demonstrate that our proposed method outperforms the state-of-the-art approaches.
Semi-supervised learning has been well established in the area of image classification but remains to be explored in video-based action recognition. FixMatch is a state-of-the-art semi-supervised method for image classification, but it does not work well when transferred directly to the video domain since it only utilizes the single RGB modality, which contains insufficient motion information. Moreover, it only leverages highly-confident pseudo-labels to explore consistency between strongly-augmented and weakly-augmented samples, resulting in limited supervised signals, long training time, and insufficient feature discriminability. To address the above issues, we propose neighbor-guided consistent and contrastive learning (NCCL), which takes both RGB and temporal gradient (TG) as input and is based on the teacher-student framework. Due to the limitation of labelled samples, we first incorporate neighbors information as a self-supervised signal to explore the consistent property, which compensates for the lack of supervised signals and the shortcoming of long training time of FixMatch. To learn more discriminative feature representations, we further propose a novel neighbor-guided category-level contrastive learning term to minimize the intra-class distance and enlarge the inter-class distance. We conduct extensive experiments on four datasets to validate the effectiveness. Compared with the state-of-the-art methods, our proposed NCCL achieves superior performance with much lower computational cost.
The last decade has witnessed the proliferation of micro-videos on various user-generated content platforms. According to our statistics, around 85.7% of micro-videos lack annotation. In this paper, we focus on annotating micro-videos with tags. Existing methods mostly focus on analyzing video content, neglecting users' social influence and tag relation. Meanwhile, existing tag relation construction methods suffer from either deficient performance or low tag coverage. To jointly model social influence and tag relation, we formulate micro-video tagging as a link prediction problem in a constructed heterogeneous network. Specifically, the tag relation (represented by tag ontology) is constructed in a semi-supervised manner. Then, we combine tag relation, video-tag annotation, and user follow relation to build the network. Afterward, a better video and tag representation are derived through Behavior Spread modeling and visual and linguistic knowledge aggregation. Finally, the semantic similarity between each micro-video and all candidate tags is calculated in this video-tag network. Extensive experiments on industrial datasets of three verticals verify the superiority of our model compared with several state-of-the-art baselines.
Scene Graph Generation, which generally follows a regular encoder-decoder pipeline, aims to first encode the visual contents within the given image and then parse them into a compact summary graph. Existing SGG approaches generally not only neglect the insufficient modality fusion between vision and language, but also fail to provide informative predicates due to the biased relationship predictions, leading SGG far from practical. Towards this end, we first present a novel Stacked Hybrid-Attention network, which facilitates the intra-modal refinement as well as the intermodal interaction, to serve as the encoder. We then devise an innovative Group Collaborative Learning strategy to optimize the decoder. Particularly, based on the observation that the recognition capability of one classifier is limited towards an extremely unbalanced dataset, we first deploy a group of classifiers that are expert in distinguishing different subsets of classes, and then cooperatively optimize them from two aspects to promote the unbiased SGG. Experiments conducted on VG and GQA datasets demonstrate that, we not only establish a new state-of-the-art in the unbiased metric, but also nearly double the performance compared with two baselines. Our code is available at https://github.com/dongxingning/SHA-GCL-for-SGG.
Influencer marketing is emerging as a new marketing method, changing the marketing strategies of brands profoundly. In order to help brands find suitable micro-influencers as marketing partners, the micro-influencer recommendation is regarded as an indispensable part of influencer marketing. However, previous works only focus on modeling the individual image of brands/micro-influencers, which is insufficient to represent the characteristics of brands/micro-influencers over the marketing scenarios. In this case, we propose a micro-influencer ranking joint learning framework which models brands/micro-influencers from the perspective of individual image, target audiences, and cooperation preferences. Specifically, to model accounts’ individual image, we extract topics information and images semantic information from historical content information, and fuse them to learn the account content representation. We introduce target audiences as a new kind of marketing role in the micro-influencer recommendation, in which audiences information of brand/micro-influencer is leveraged to learn the multi-modal account audiences representation. Afterward, we build the attribute co-occurrence graph network to mine cooperation preferences from social media interaction information. Based on account attributes, the cooperation preferences between brands and micro-influencers are refined to attributes’ co-occurrence information. The attribute node embeddings learned in the attribute co-occurrence graph network are further utilized to construct the account attribute representation. Finally, the global ranking function is designed to generate ranking scores for all brand-micro-influencer pairs from the three perspectives jointly. The extensive experiments on a publicly available dataset demonstrate the effectiveness of our proposed model over the state-of-the-art methods.
Cross-modal retrieval aims to retrieve relevant data from another modality when given a query of one modality. Although most existing methods that rely on the label information of multimedia data have achieved promising results, the performance benefiting from labeled data comes at a high cost since labeling data often requires enormous labor resources, especially on large-scale multimedia datasets. Therefore, unsupervised cross-modal learning is of crucial importance in real-world applications. In this paper, we propose a novel unsupervised cross-modal retrieval method, named Self-supervised Correlation Learning (SCL), which takes full advantage of large amounts of unlabeled data to learn discriminative and modality-invariant representations. Since unsupervised learning lacks the supervision of category labels, we incorporate the knowledge from the input as a supervisory signal by maximizing the mutual information between the input and the output of different modality-specific projectors. Besides, for the purpose of learning discriminative representations, we exploit unsupervised contrastive learning to model the relationship among intra- and inter-modality instances, which makes similar samples closer and pushes dissimilar samples apart. Moreover, to further eliminate the modality gap, we use a weight-sharing scheme and minimize the modality-invariant loss in the joint representation space. Beyond that, we also extend the proposed method to the semi-supervised setting. Extensive experiments conducted on three widely-used benchmark datasets demonstrate that our method achieves competitive results compared with current state-of-the-art cross-modal retrieval approaches.
Scene Graph Generation (SGG) aims to detect the objects and their pairwise predicates in an image. Existing SGG methods mainly fulfil the challenging predicate prediction task that involves severe long-tailed data distribution with a single classifier. However, we argue that this may be enough to differentiate predicates that present obvious differences (e.g., on and near ), but not sufficient to distinguish similar predicates that only have subtle differences (e.g., on and standing on ). Towards this end, we divide the predicate prediction into a few sub-tasks with a Divide-and-Conquer Predictor (DC-Predictor). Specifically, we first develop an offline pattern-predicate correlation mining algorithm to discover the similar predicates that share the same object interaction pattern. Based on that, we devise a general pattern classifier and a set of specific predicate classifiers for DC-Predictor. The former works on recognizing the pattern of a given object pair and routing it to the corresponding specific predicate classifier, while the latter aims to differentiate similar predicates in each specific pattern. In addition, we introduce the Bayesian Personalized Ranking loss in each specific predicate classifier to enhance the pairwise differentiation between head predicates and their similar ones. Experiments on VG150 and GQA datasets show the superiority of our model over state-of-the-art methods.
With the rapid development of the influencer marketing industry in recent years, the cooperation between brands and micro-influencers on marketing has achieved much attention. As a key sub-task of influencer marketing, micro-influencer recommendation is gaining momentum. However, in influencer marketing campaigns, it is not enough to only consider marketing effectiveness. Towards this end, we propose a concept-based micro-influencer ranking framework, to address the problems of marketing effectiveness and self-development needs for the task of micro-influencer recommendation. Marketing effectiveness is improved by concept-based social media account representation and a micro-influencer ranking function. We conduct social media account representation from the perspective of historical activities and marketing direction. And two adaptive learned metrics, endorsement effect score and micro-influencer influence score, are defined to learn the micro-influencer ranking function. To meet self-development needs, we design a bi-directional concept attention mechanism to focus on brands’ and micro-influencers’ marketing direction over social media concepts. Interpretable concept-based parameters are utilized to help brands and micro-influencers make marketing decisions. Extensive experiments conducted on a real-world dataset demonstrate the advantage of our proposed method compared with the state-of-the-art methods.
GWI survey1 has highlighted the flourishing use of multiple social networks: the average number of social media accounts per Internet user is 5.54, and among them, 2.82 are being used actively. Indeed, users tend to express their views in more than one social media site. Hence, merging social signals of the same user across different social networks together, if available, can facilitate the downstream analyses. Previous work has paid little attention on modeling the cooperation among the following factors when fusing data from multiple social networks: 1) as data from different sources characterizes the characteristics of the same social user, the source consistency merits our attention; 2) due to their different functional emphases, some aspects of the same user captured by different social networks can be just complementary and results in the source complementarity; and 3) different sources can contribute differently to the user characterization and hence lead to the different source confidence. Toward this end, we propose a novel unified model, which co-regularizes source consistency, complementarity, and confidence to boost the learning performance with multiple social networks. In addition, we derived its theoretical solution and verified the model with the real-world application of user interest inference. Extensive experiments over several state-of-the-art competitors have justified the superiority of our model.1http://tinyurl.com/zk6kgc9
With rising awareness of environment protection and recycling, second-hand trading platforms have attracted increasing attention in recent years. The interaction data on second-hand trading platforms, consisting of sufficient interactions per user but rare interactions per item, is different from what they are on traditional platforms. Therefore, building successful recommendation systems in the second-hand trading platforms requires balancing modeling items? and users? preference, and mitigating the adverse effects of the sparsity, which makes recommendation especially challenging. Accordingly, we proposed a method to simultaneously learn representations of items and users from coarse-grained and fine-grained features, and a multi-task learning strategy is designed to address the issue of data sparsity. Experiments conducted on a real-world second-hand trading platform dataset demonstrate the effectiveness of our proposed model.
Recommending new items in real-world e-commerce portals is a challenging problem as the cold start phenomenon, i.e., lacks of user-item interactions. To address this problem, we propose a novel recommendation model, i.e., adversarial neural network with multiple generators, to generate users from multiple perspectives of items' attributes. Namely, the generated users are represented by attribute-level features. As both users and items are attribute-level representations, we can implicitly obtain user-item attribute-level interaction information. In light of this, the new item can be recommended to users based on attribute-level similarity. Extensive experimental results on two item cold-start scenarios, movie and goods recommendation, verify the effectiveness of our proposed model as compared to state-of-the-art baselines.
Hashtags, a user provides to a micro-video, are the ones which can well describe the semantics of the micro-video's content in his/her mind. At the same time, hashtags have been widely used to facilitate various micro-video retrieval scenarios (e.g., search, browse, and categorization). Despite their importance, numerous micro-videos lack hashtags or contain inaccurate or incomplete hashtags. In light of this, hashtag recommendation, which suggests a list of hashtags to a user when he/she wants to annotate a post, becomes a crucial research problem. However, little attention has been paid to micro-video hashtag recommendation, mainly due to the following three reasons: 1) lack of benchmark dataset; 2) the temporal and multi-modality characteristics of micro-videos; and 3) hashtag sparsity and long-tail distributions. In this paper, we recommend hashtags for micro-videos by presenting a novel multi-view representation interactive embedding model with graph-based information propagation. It is capable of boosting the performance of micro-videos hashtag recommendation by jointly considering the sequential feature learning, the video-user-hashtag interaction, and the hashtag correlations. Extensive experiments on a constructed dataset demonstrate our proposed method outperforms state-of-the-art baselines. As a side research contribution, we have released our dataset and codes to facilitate the research in this community.
What made you want to wear the clothes you are wearing? Where is the place you want to visit for your next-coming holiday? Why do you like the music you frequently listen to? If you are like most people, you probably made these decisions as a result of watching influencers on social media. Furthermore, influencer marketing is an opportunity for brands to take advantage of social media using a well-defined and well-designed social media marketing strategy. However, choosing the right influencers is not an easy task. With more people gaining an increasing number of followers in social media, finding the right influencer for an E-commerce company becomes paramount. In fact, most marketers cite it as a top challenge for their brands. To address the aforementioned issues, we proposed a data-driven micro-influencer ranking scheme to solve the essential question of finding out the right micro-influencer. Specifically, we represented brands and influencers by fusing their historical posts' visual and textual information. A novel k-buckets sampling strategy with a modified listwise learning to rank model were proposed to learn a brand-micro-influncer scoring function. In addition, we developed a new Instagram brand micro-influencer dataset, consisting of 360 brands and 3,748 micro-influencers, which can benefit future researchers in this area. The extensive evaluations demonstrate the advantage of our proposed method compared with the state-of-the-art methods.
Presentation has been an effective method for delivering information to an audience for many years. Over the past few decades, technological advancements have revolutionized the way humans deliver presentation. Conventionally, the quality of a presentation is usually evaluated through painstaking manual analysis with experts. Although the expert feedback is effective in assisting users to improve their presentation skills, manual evaluation suffers from high cost and is often not available to most individuals. In this work, we propose a novel multi-sensor self-quantification system for presentations, which is designed based on a new proposed assessment rubric. We present our analytics model with conventional ambient sensors (i.e., static cameras and Kinect sensor) and the emerging wearable egocentric sensors (i.e., Google Glass). In addition, we performed a cross-correlation analysis of speaker’s vocal behavior and body language. The proposed framework is evaluated on a new presentation dataset, namely, NUS Multi-Sensor Presentation dataset, which consists of 51 presentations covering a diverse range of topics. To validate the efficacy of the proposed system, we have conducted a series of user studies with the speakers and an interview with an English communication expert, which reveals positive and promising feedback.
GWI survey has highlighted the flourishing use of multiple social networks: the average number of social media accounts per Internet user is 5.54, and among them, 2.82 are being used actively. Indeed, users tend to express their views in more than one social media site. Hence, merging social signals of the same user across different social networks together, if available, can facilitate the downstream analyses. Previous work has paid little attention on modeling the cooperation among the following factors when fusing data from multiple social networks: 1) as data from different sources characterizes the characteristics of the same social user, the source consistency merits our attention; 2) due to their different functional emphases, some aspects of the same user captured by different social networks can be just complementary and results in the source complementarity; and 3) different sources can contribute differently to the user characterization and hence lead to the different source confidence. Toward this end, we propose a novel unified model, which co-regularizes source consistency, complementarity, and confidence to boost the learning performance with multiple social networks. In addition, we derived its theoretical solution and verified the model with the real-world application of user interest inference. Extensive experiments over several state-of-the-art competitors have justified the superiority of our model.
Text classification is one of the fundamental tasks in natural language processing. Recently, deep neural networks have achieved promising performance in the text classification task compared to shallow models. Despite of the significance of deep models, they ignore the fine-grained (matching signals between words and classes) classification clues since their classifications mainly rely on the text-level representations. To address this problem, we introduce the interaction mechanism to incorporate word-level matching signals into the text classification task. In particular, we design a novel framework, EXplicit interAction Model (dubbed as EXAM), equipped with the interaction mechanism. We justified the proposed approach on several benchmark datasets including both multilabel and multi-class text classification tasks. Extensive experimental results demonstrate the superiority of the proposed method. As a byproduct, we have released the codes and parameter settings to facilitate other researches.
Social working memory (SWM) plays an important role in navigating social interactions. Inspired by studies in psychology, neuroscience, cognitive science, and machine learning, we propose a probabilistic model of SWM to mimic human social intelligence for personal information retrieval (IR) in social interactions. First, we establish a semantic hierarchy as social long-term memory to encode personal information. Next, we propose a semantic Bayesian network as the SWM, which integrates the cognitive functions of accessibility and self-regulation. One subgraphical model implements the accessibility function to learn the social consensus about IR-based on social information concept, clustering, social context, and similarity between persons. Beyond accessibility, one more layer is added to simulate the function of self-regulation to perform the personal adaptation to the consensus based on human personality. Two learning algorithms are proposed to train the probabilistic SWM model on a raw dataset of high uncertainty and incompleteness. One is an efficient learning algorithm of Newton's method, and the other is a genetic algorithm. Systematic evaluations show that the proposed SWM model is able to learn human social intelligence effectively and outperforms the baseline Bayesian cognitive model. Toward real-world applications, we implement our model on Google Glass as a wearable assistant for social interaction.
hey report it. However, the information from the social sensors (like Facebook, Twitter, Instagram) typically is in the form of multimedia (text, image, video, etc.), thus coping with such information and mining useful knowledge from it will be an increasingly difficult task. In this paper, we first crawl social video data (text and video) from social sensor Twitter for a specific social event. Then the textual, acoustic, and visual features are extracted from these data. At last, we classify the social videos into subjective and objective videos by merging these different modality affective information. Preliminary experiments show that our proposed method is able to accurately classify the social sensor data’s subjectivity.
A wearable system is designed to provide directional information to visually impaired people. It consists of a mobile phone and haptic shoes. The former serves as the perceptual and control unit that generates directional instructions. Upon receiving the instructions, the shoes combine them with the user's walking status to produce unique vibration patterns. To enable effective direction sensing, a few alternative configurations are proposed, whereby the position and strength of vibrations are modulated programmatically. The prototype system is evaluated in a usability test with 60 subjects. By comparing different configurations, it is shown that the system achieves varying levels of perception accuracy under various walking conditions, and that the proposed design is advantageous to the benchmarking configuration. The system has great potential to provide smart and sensible navigation guidance to visually impaired people especially when integrated with visual processing units.
Wearable vision brings about new opportunities for augmenting humans in social interactions. However, along with it comes privacy concerns and possible information overload. We explore users' needs and attitudes toward augmented interaction in face-to-face communications. In particular, we want to find out whether users need additional information when interacting with acquaintances, what information they want to access, and how they use it. Based on observations of user behaviors in interactions assisted by Google Glass, we find that users in general appreciated the usefulness of wearable assistance for social interactions. We highlight a few key issues of how wearable devices affect user experience in social interaction.
Presentations have been an effective means of delivering information to groups for ages. Over the past few decades, technological advancements have revolutionized the way humans deliver presentations. Despite that, the quality of presentations can be varied and affected by a variety of reasons. Conventional presentation evaluation usually requires painstaking manual analysis by experts. Although the expert feedback can definitely assist users in improving their presentation skills, manual evaluation suffers from high cost and is often not accessible to most people. In this work, we propose a novel multi-sensor self-quantification framework for presentations. Utilizing conventional ambient sensors (i.e., static cameras, Kinect sensor) and the emerging wearable egocentric sensors (i.e., Google Glass), we first analyze the efficacy of each type of sensor with various nonverbal assessment rubrics, which is followed by our proposed multi-sensor presentation analytics framework. The proposed framework is evaluated on a new presentation dataset, namely NUS Multi-Sensor Presentation (NUSMSP) dataset, which consists of 51 presentations covering a diverse set of topics. The dataset was recorded with ambient static cameras, Kinect sensor, and Google Glass. In addition to multi-sensor analytics, we have conducted a user study with the speakers to verify the effectiveness of our system generated analytics, which has received positive and promising feedback.
In a typical multi-person social interaction, spatial information plays an important role in analyzing the structure of the social interaction. Previous studies, which analyze spatial structure of the social interaction using one or more third-person view cameras, suffer from the occlusion problem. With the increasing popularity of wearable computing devices, we are now able to obtain natural first-person observations with limited occlusion. However, such observations have a limited field of view, and can only capture a portion of the social interaction. To overcome the aforementioned limitation, we propose a search-based structure recovery method in a small group conversational social interaction scenario to reconstruct the social interaction structure from multiple first-person views, where each of them contributes to the multifaceted understanding of the social interaction. We first map each first-person view to a local coordinate system, then a set of constraints and spatial relationships are extracted from these local coordinate systems. Finally, the human spatial configuration is searched under the constraints to "best" match the extracted relationships. The proposed method is much simpler than full 3D reconstruction, and suffices for capturing the social interaction spatial structure. Experiments for both simulated and real-world data show the efficacy of the proposed method.
In the context of a social gathering, such as a cocktail party, the memorable moments are generally captured by professional photographers or by the participants. The latter case is often undesirable because many participants would rather enjoy the event instead of being occupied by the photo-taking task. Motivated by this scenario, we propose the use of a set of cameras to automatically take photos. Instead of performing dense analysis on all cameras for photo capturing, we first detect the occurrence and location of social interactions via F-formation detection. In the sociology literature, F-formation is a concept used to define social interactions, where each detection only requires the spatial location and orientation of each participant. This information can be robustly obtained with additional Kinect depth sensors. In this paper, we propose an extended F-formation system for robust detection of interactions and interactants. The extended F-formation system employs a heat-map based feature representation for each individual, namely Interaction Space (IS), to model their location, orientation, and temporal information. Using the temporally encoded IS for each detected interactant, we propose a best-view camera selection framework to detect the corresponding best view camera for each detected social interaction. The extended F-formation system is evaluated with synthetic data on multiple scenarios. To demonstrate the effectiveness of the proposed system, we conducted a user study to compare our best view camera ranking with human's ranking using real-world data.
In the context of a social gathering, such as a cocktail party, the memorable moments are often captured by professional photographers or the participants. The latter case is generally undesirable because many participants would rather enjoy the event instead of being occupied by the tedious photo capturing task. Motivated by this scenario, we propose an automated social event photo-capture framework for which, given the multiple sensor data streams and the information from the Web as input, will output the visually appealing photos of the social event. Our proposal consists of three components: (1) social attribute extraction from both the physical space and the cyberspace; (2) social attribute fusion; and (3) active camera control. Current work is presented and we conclude with expected contributions as well as future direction.
I would be happy to talk to you if you need my assistance in your research. My Office is at N3-422-2, No. 72 Binhai Road, Qingdao, Shandong, China 266237. (山东省青岛市即墨滨海路72号山东大学青岛校区,266237)