UMono: Physical-Model-Informed Hybrid CNN–Transformer Framework for Underwater Monocular Depth Estimation

Wu, Xupeng; Wang, Jian; Wang, Jing; Rong, Shenghui; He, Bo

UMono: Physical-Model-Informed Hybrid CNN–Transformer Framework for Underwater Monocular Depth Estimation

Export / Share

Share this record

Export this record

PlumX

Altmetric

View Altmetric information about this item.

Wu, X., Wang, J., Wang, J., Rong, S. and He, B. (2026) UMono: Physical-Model-Informed Hybrid CNN–Transformer Framework for Underwater Monocular Depth Estimation. IEEE Journal of Oceanic Engineering, 51 (1). pp. 780-793. https://doi.org/10.1109/JOE.2025.3606045

Preview

PDF
6MB

Article Link: https://doi.org/10.1109/JOE.2025.3606045

Abstract

Underwater monocular depth estimation serves as the foundation for tasks such as 3-D reconstruction of underwater scenes. However, due to the water medium and the absorption and scattering of light in water, the underwater environment undergoes a distinctive imaging process, which presents challenges in accurately estimating depth from a single image. The existing methods fail to consider the unique characteristics of underwater environments, leading to inadequate estimation results and limited generalization performance. Furthermore, underwater depth estimation requires extracting and fusing both local and global features, which is not fully explored in existing methods. In this article, an end-to-end learning framework for underwater monocular depth estimation called UMono is presented, which incorporates underwater image formation model characteristics into the network architecture, and effectively utilizes both local and global features of an underwater image. Specifically, UMono consists of an encoder with a hybrid architecture of a convolutional neural network (CNN) and Transformer and a decoder guided by a medium transmission map. First, we develop an underwater deep feature extraction (UDFE) block, which leverages the CNN and Transformer in parallel to achieve comprehensive extraction of both local and global features. These features are effectively integrated via the proposed local–global feature fusion (LGFF) module. By stacking the UDFE block as the basic unit, we constructed a hybrid encoder that generates four-stage hierarchical features. Subsequently, the medium transmission map is incorporated into the network as underwater domain knowledge, together with the encoded hierarchical features, is fed into the underwater depth information aggregation (UDIA) module, which aggregates depth information from the physical model and the neural network by a proposed cross attention mechanism. Then, the aggregated features serve as the guiding information for each decoding stage, facilitating the model in achieving comprehensive scene understanding and precise depth estimation. The final estimated depth map is obtained through consecutive upsampling processing. Experimental results demonstrate that the proposed method is effective for underwater monocular depth estimation and outperforms the existing methods in both quantitative and qualitative analyses.

Item Type:	Article
Corporate Creators:	Department of Primary Industries, Queensland
Business groups:	Animal Science
Additional Information:	DPI author Jing Wang
Keywords:	Domain knowledge ; hybrid architecture ; medium transmission map ; underwater depth estimation
Subjects:	Aquaculture and Fisheries Aquaculture and Fisheries > Fisheries > Fishery technology Aquaculture and Fisheries > Fisheries > Fishery oceanography. Hydrologic factors Technology > Technology (General)
Live Archive:	26 Nov 2025 01:21
Last Modified:	30 Jan 2026 03:43

Repository Staff Only: item control page

Download Statistics

Downloads

Downloads per month over past year

View more statistics

	eRA Home \|\| About \|\| Browse \|\| Search \|\| Help
	Login \| DPI Staff queries on depositing or searching to era@dpi.qld.gov.au