Chai, Chengzhang ![]() ![]() ![]() Item availability restricted. |
![]() |
PDF
- Accepted Post-Print Version
Restricted to Repository staff only until 14 March 2026 due to copyright restrictions. Available under License Creative Commons Attribution Non-commercial No Derivatives. Download (2MB) |
Abstract
Deep learning-based bridge visual inspection often produces limited outputs, lacking the accurate descriptions required for practical assessments. Researchers have explored multimodal approaches to generate damage descriptions, but existing models are prone to hallucination and face challenges related to feature representation sufficiency, attention mechanism flexibility, and domain-specific knowledge integration. This paper develops an image captioning framework driven by domain knowledge to address these issues. It incorporates a multi-level feature fusion module that adaptively integrates Faster R-CNN trained weights (domain knowledge) with a CNN architecture. Additionally, it introduces a correlation-aware attention mechanism to dynamically capture interdependencies between image regions and optimise the attentional focus during LSTM decoding. Experimental results show that the proposed framework achieves higher BLEU scores and improves image-text alignment as verified through attention heatmaps. While the framework enhances inspection efficiency and quality, further dataset expansion and broader domain validation are required to assess its generalisation ability.
Item Type: | Article |
---|---|
Date Type: | Publication |
Status: | Published |
Schools: | Schools > Engineering |
Publisher: | Elsevier |
ISSN: | 0926-5805 |
Date of First Compliant Deposit: | 9 March 2025 |
Date of Acceptance: | 3 March 2025 |
Last Modified: | 28 Apr 2025 14:15 |
URI: | https://orca.cardiff.ac.uk/id/eprint/176738 |
Actions (repository staff only)
![]() |
Edit Item |