Self-supervised Vision Transformers for 3D pose estimation of novel objects

Por favor, use este identificador para citar o enlazar este ítem: http://hdl.handle.net/10045/137301
Información del item - Informació de l'item - Item information
Título: Self-supervised Vision Transformers for 3D pose estimation of novel objects
Autor/es: Thalhammer, Stefan | Weibel, Jean-Baptiste | Vincze, Markus | Garcia-Rodriguez, Jose
Grupo/s de investigación o GITE: Arquitecturas Inteligentes Aplicadas (AIA)
Centro, Departamento o Servicio: Universidad de Alicante. Departamento de Tecnología Informática y Computación
Palabras clave: Object pose estimation | Template matching | Vision transformer | Self-supervised learning
Fecha de publicación: 12-sep-2023
Editor: Elsevier
Cita bibliográfica: Image and Vision Computing. 2023, 139: 104816. https://doi.org/10.1016/j.imavis.2023.104816
Resumen: Object pose estimation is important for object manipulation and scene understanding. In order to improve the general applicability of pose estimators, recent research focuses on providing estimates for novel objects, that is, objects unseen during training. Such works use deep template matching strategies to retrieve the closest template connected to a query image, which implicitly provides object class and pose. Despite the recent success and improvements of Vision Transformers over CNNs for many vision tasks, the state of the art uses CNN-based approaches for novel object pose estimation. This work evaluates and demonstrates the differences between self-supervised CNNs and Vision Transformers for deep template matching. In detail, both types of approaches are trained using contrastive learning to match training images against rendered templates of isolated objects. At test time such templates are matched against query images of known and novel objects under challenging settings, such as clutter, occlusion and object symmetries, using masked cosine similarity. The presented results not only demonstrate that Vision Transformers improve matching accuracy over CNNs but also that for some cases pre-trained Vision Transformers do not need fine-tuning to achieve the improvement. Furthermore, we highlight the differences in optimization and network architecture when comparing these two types of networks for deep template matching.
Patrocinador/es: We gratefully acknowledge the support of the EU-program EC Horizon 2020 for Research and Innovation under grant agreement No. 101017089, project TraceBot and the NVIDIA Corporation for supporting this research by providing hardware resources.
URI: http://hdl.handle.net/10045/137301
ISSN: 0262-8856 (Print) | 1872-8138 (Online)
DOI: 10.1016/j.imavis.2023.104816
Idioma: eng
Tipo: info:eu-repo/semantics/article
Derechos: © 2023 Published by Elsevier B.V.
Revisión científica: si
Versión del editor: https://doi.org/10.1016/j.imavis.2023.104816
Aparece en las colecciones:INV - AIA - Artículos de Revistas
Investigaciones financiadas por la UE

Archivos en este ítem:
Archivos en este ítem:
Archivo Descripción TamañoFormato 
ThumbnailThalhammer_etal_2023_ImageVisionComput_final.pdfVersión final (acceso restringido)3,43 MBAdobe PDFAbrir    Solicitar una copia
ThumbnailThalhammer_etal_2023_ImageVisionComput_preprint.pdfPreprint (acceso abierto)992,81 kBAdobe PDFAbrir Vista previa


Todos los documentos en RUA están protegidos por derechos de autor. Algunos derechos reservados.