MMCVLab | Resource

Human-Object Interaction Detection via Disentangled Transformer
Weakly Supervised Deep Hyperspherical Quantization for Image Retrieval
RetCCL: Clustering-guided contrastive learning for whole-slide image retrieval
ViT2Hash: Unsupervised Information-Preserving Hashing
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
A simple, efficient and scalable contrastive masked autoencoder for learning visual representations
Character Region Attention for Text Spotting
FairMOT : On the Fairness of Detection and Re-Identification in Multiple Object Tracking
AutoFormer: Searching Transformers for Visual Recognition
TriDet: Temporal Action Detection with Relative Boundary Modeling
EPSANet: An Efficient Pyramid Squeeze Attention Block on Convolutional Neural Network
Mining Cross-Person Cues for Body-Part Interactiveness Learning in HOI Detection
Pruning Filters for Efficient ConvNets
MTL-NAS: Task-Agnostic Neural Architecture Search towards General-Purpose Multi-Task Learning
Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere
CompRess: Self-Supervised Learning by Compressing Representations
VideoBERT: A Joint Model for Video and Language Representation Learning
End-to-end Multiple Instance Learning for Whole-Slide Cytopathology of Urothelial Carcinoma
Few-shot Font Style Transfer between Different Languages
Optimizing Network Structure for 3D Human Pose Estimation
RepVGG: Making VGG-style ConvNets Great Again
Out-of-distribution Detection in Classifiers via generation
Affect2MM: Affective Analysis of Multimedia Content Using Emotion Causality
A Deep Factorization of Style and Structure in Fonts
ContourNet Taking a Further Step toward Accurate Arbitrary-shaped Scene Text Detection
SPot-the-Difference Self-Supervised Pre-training for Anomaly Detection and Segmentation
Residual Parameter Transfer for Deep Domain Adaptation
Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation
Improved Knowledge Distillation via Teacher Assistant
Learning Meta Face Recognition in Unseen Domains
One Loss for All: Deep Hashing with a Single Cosine Similarity based Learning Objective
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification
Training Confidence-Calibrated Classifiers for Detecting Out-of-Distribution Samples
Contrastive Learning for Unpaired Image-to-Image Translation
Training Confidence-Caiibrated Classifiers for Detecting Out-of-Distribution Samples
Knowledge Distillation Meets Self-Supervision
X3D: Expanding Architectures for Efficient Video Recognition
FairNAS: Rethinking Evaluation Fairness of Weight Sharing Neural Architecture Search
Multimodal Fusion via Teacher-Student Network for Indoor Action Recognition
SinGAN: Learning a Generative Model from a Single Natural Image
Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images
PDAN: Pyramid Dilated Attention Network for Action Detection
Feature Selection Based Transfer Subspace Learning for Speech Emotion Recognition
Transformer-based unsupervised contrastive learning for histopathological image classification
OPANAS: One-Shot Path Aggregation Network Architecture Search for Object Detection
OCGAN: One-class Novelty Detection Using GANs with Constrained Latent Representations
Local Correlation for Knowledge Distillation
LiftFormer 3D Human Pose Estimation using attention models
Machine Learning for Enhancing Dementia Screening in Ageing Deaf Signers of British Sign Language
Out-of-Distribution Detection for Long-tailed and Fine-grained Skin Lesion Images
3D Human Pose Estimation in Video with Temporal Convolutions and Semi-Supervised Training
Generalized ODIN: Detecting Out-of-distribution Image without Learning from Out-of-distribution Data
Suppressing Uncertainties for Large-Scale Facial Expression Recognition
GA-DAN: Geometry-Aware Domain Adaptation Network for Scene Text Detection and Recognition
Multi-Scale Networks for 3D Human Pose
GroupViT: Semantic Segmentation Emerges from Text Supervision
Relational Knowledge Distillation
MoQuad: Motion-focused Quadruple Construction for Video Contrastive Learning
Utilizing Patch-level Category Activation Patterns for Multiple Class Novelty Detection
Language Models as Black-Box Optimizers for Vision-Language Models
Long-Short Temporal Contrastive Learning of Video Transformer
Visual Prompt Tuning for Generative Transfer Learning
DLOW: Domain Flow for Adaptation and Generalization
Towards Total Recall in Industrial Anomaly Detection
Revisiting Knowledge Distillation via Label Smoothing Regularization
Deep Supervised Cross-modal Retrieval
BKinD-3D: Self-Supervised 3D Keypoint Discovery from Multi-View Videos
SA-PatchCore: Anomaly Detection in Dataset With Co-Occurrence Relationships Using Self-Attention
Multiple Class Novelty Detection Under Data Distribution Shift
Consistency Regularization for Generative Adversarial Networks
Interact, Embed, and EnlargE (IEEE): Boosting Modality-specific Representations for Multi-Modal Person Re-identification
Neural Architecture Search for Joint Human Parsing and Pose Estimation
Enhanced 3D Human Pose Estimation from Videos by using Attention-Based Neural Network with Dilated Convolutions
Mask Transfiner for High-Quality Instance Segmentation
Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective
MoViNets: Mobile Video Networks for Efficient Video Recognition
Revisiting Knowledge Distillation: An Inheritance and Exploration Framework
Self-Supervised Predictive Convolutional Attentive Block for Anomaly Detection
Spontaneous Facial Micro-Expression Recognition using 3D Spatiotemporal Convolutional Neural Networks
CvT Introducing Convolutions to Vision Transformers
Refine Myself by Teaching Myself: Feature Refinement via Self-Knowledge Distillation
Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations
Token Merging: Your ViT But Faster
Hide-and-Tell Learning to Bridge Photo Streams for Visual Storytelling
Multi-class-Novelty-Detection-Using-Mix-up-Technique
Training data-efficient image transformers & distillation through attention
Knowledge Distillation from Internal Representations
Improve Object Detection with Feature-Based Knowledge Distillation: Towards Accurate And Efficient Detectors
MM-ViT Multi-Modal Video Transformer for Compressed Video Action Recognition
Bi-Directional Generation for Unsupervised Domain Adaptation
Audio-Visual Weakly Supervised Approach for Apathy Detection in the Elderly
To BERT or Not To BERT: Comparing Speech and Language-based Approaches for Alzheimer's Disease Detection

Multimedia and Computer Vision Laboratory

National Cheng Kung University

Resource