UMono: Physical-Model-Informed Hybrid CNN–Transformer Framework for Underwater Monocular Depth EstimationExport / Share PlumX Wu, X., Wang, J., Wang, J., Rong, S. and He, B. (2025) UMono: Physical-Model-Informed Hybrid CNN–Transformer Framework for Underwater Monocular Depth Estimation. IEEE Journal of Oceanic Engineering . pp. 1-14. https://doi.org/10.1109/JOE.2025.3606045 Full text not currently attached. Access may be available via the Publisher's website or OpenAccess link. Article Link: https://doi.org/10.1109/JOE.2025.3606045 AbstractUnderwater monocular depth estimation serves as the foundation for tasks such as 3-D reconstruction of underwater scenes. However, due to the water medium and the absorption and scattering of light in water, the underwater environment undergoes a distinctive imaging process, which presents challenges in accurately estimating depth from a single image. The existing methods fail to consider the unique characteristics of underwater environments, leading to inadequate estimation results and limited generalization performance. Furthermore, underwater depth estimation requires extracting and fusing both local and global features, which is not fully explored in existing methods. In this article, an end-to-end learning framework for underwater monocular depth estimation called UMono is presented, which incorporates underwater image formation model characteristics into the network architecture, and effectively utilizes both local and global features of an underwater image. Specifically, UMono consists of an encoder with a hybrid architecture of a convolutional neural network (CNN) and Transformer and a decoder guided by a medium transmission map. First, we develop an underwater deep feature extraction (UDFE) block, which leverages the CNN and Transformer in parallel to achieve comprehensive extraction of both local and global features. These features are effectively integrated via the proposed local–global feature fusion (LGFF) module. By stacking the UDFE block as the basic unit, we constructed a hybrid encoder that generates four-stage hierarchical features. Subsequently, the medium transmission map is incorporated into the network as underwater domain knowledge, together with the encoded hierarchical features, is fed into the underwater depth information aggregation (UDIA) module, which aggregates depth information from the physical model and the neural network by a proposed cross attention mechanism. Then, the aggregated features serve as the guiding information for each decoding stage, facilitating the model in achieving comprehensive scene understanding and precise depth estimation. The final estimated depth map is obtained through consecutive upsampling processing. Experimental results demonstrate that the proposed method is effective for underwater monocular depth estimation and outperforms the existing methods in both quantitative and qualitative analyses.
Repository Staff Only: item control page |
Export / Share
Export / Share