AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment

1CIIRC, CTU Prague 2FEE, CTU Prague
Approach overview.

AlignPose is a multi-view 6D object pose estimation method. It accepts a set of RGB images captured from multiple viewpoints with known camera poses and a set of 3D object models as input. Its primary goal is to estimate the 6D poses of all object instances within the scene. The approach consists of three steps: first, it extracts single-view pose candidates from all viewpoints using a single-view estimator. These candidates are then aggregated into a common coordinate system, where non-maximum suppression (NMS) removes redundancies. Finally, a novel multi-view feature-metric refinement is applied to ensure the resulting poses are consistent across all views.

Abstract

Single-view RGB model-based object pose estimation methods achieve strong generalization but are fundamentally limited by depth ambiguity, clutter, and occlusions. Multi-view pose estimation methods have the potential to solve these issues, but existing works rely on precise single-view pose estimates or lack generalization to unseen objects. We address these challenges via the following three contributions. First, we introduce AlignPose, a 6D object pose estimation method that aggregates information from multiple extrinsically calibrated RGB views and does not require any object-specific training or symmetry annotation. Second, the key component of this approach is a new multi-view feature-metric refinement specifically designed for object pose. It optimizes a single, consistent world-frame object pose minimizing the feature discrepancy between on-the-fly rendered object features and observed image features across all views simultaneously. Third, we report extensive experiments on four datasets (YCB-V, T-LESS, ITODD-MV, HouseCat6D) using the BOP benchmark evaluation and show that AlignPose outperforms other published methods, especially on challenging industrial datasets where multiple views are readily available in practice.

Qualitative Results

Input image
Input pose candidates
Our refined poses
YCB-V
HouseCat6D
T-LESS

BibTeX

@misc{AlignPose2025,
      title={AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment}, 
      author={Anna Šárová Mikeštíková and Médéric Fourmy and Martin Cífka and Josef Sivic and Vladimir Petrik},
      year={2025},
      eprint={2512.20538},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.20538}, 
}

Acknowledgements

This work was supported by the European Union’s Horizon Europe projects AGIMUS (No. 101070165), euROBIN (No. 101070596), ERC FRONTIER (No. 101097822), and ELLIOT (No. 101214398). It was further supported by the Grant Agency of the Czech Technical University in Prague (SGS25/152/OHK3/3T/13). Compute resources and infrastructure were supported by the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:90254) and by the European Union’s Horizon Europe project CLARA (No. 101136607).