现代纺织技术 ›› 2025, Vol. 33 ›› Issue (05): 96-106.

• • 上一篇    下一篇

基于交互式多头注意力机制的虚拟试穿网络

  

  1. 1.浙江理工大学计算机科学与技术学院,杭州 310018;2.浙江省现代纺织技术创新中心,浙江绍兴 310020
  • 出版日期:2025-05-10 网络出版日期:2025-05-20

Virtual try-on networks based on interactive multiple attention mechanisms

  1. 1. School of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China; 
    2. Zhejiang Provincial Innovation Center of Advanced Textile Technology, Shaoxing 310020, China
  • Published:2025-05-10 Online:2025-05-20

摘要: 为解决现有虚拟试穿模型的服装特征与细节容易丢失(如无法准确识别服装的长短袖、颜色及袖、领口细节)等问题,提出基于交互式多头注意力机制的虚拟试穿网络。在StableVITON模型的服装编码块中引入交互式多头注意力机制,促进不同头之间的交互,学习丰富的特征相关性,从而增强网络注意力性能,保留更多的服装特征和细节。在VITON-HD数据集上进行定性和定量实验,结果表明:与其他主流模型相比,该虚拟试穿网络的服装整体特征、局部细节更加真实;与StableVITON模型相比,该网络在平均结构相似性(SSIM)上提高了1.53%,平均感知图像相似度(LPIPS)降低了0.71%,平均弗雷歇距离(FID)降低了0.15%,平均核Inception距离(KID)降低了1.14%。该虚拟试穿网络有效保留了服装特征细节,显著增强了图像的保真度,其合成的试穿图像能为消费者带来更好的购物体验,可广泛应用于虚拟试穿等数字时尚应用场景。

关键词: 交互式, 多头注意力, StableVITON, 虚拟试穿, 稳定扩散

Abstract: With the booming development of e-commerce and the popularity of online clothing shopping, virtual try-on technology has been significantly promoted. At present, virtual try-on technology is mainly divided into two categories: 3D and 2D images, among which 2D image virtual try-on is widely used due to its easy operation and low cost. This technology is further subdivided into methods based on Generative Adversarial Networks (GANs) and diffusion networks. In recent years, virtual try-on based on diffusion networks has received widespread attention due to its superior performance in realism, stability, and detail processing compared to GAN networks. StableVITON is an important benchmark model in this field and has achieved significant results in synthesizing try-on images by relying on the powerful generation ability of diffusion networks. However, there are still shortcomings in capturing and preserving clothing features and details, such as the inability to accurately identify clothing's long and short sleeves, colors, as well as details such as cuffs and necklines.
To address the problem of clothing feature and detail loss in the StableVITON, this paper proposed a virtual try-on network based on an interactive multi-head attention mechanism. Specifically, this article introduced an interactive multi-head attention mechanism in the clothing encoding block of the StableVITON to facilitate the interaction between different heads and learn rich feature correlations, so as to enhance the network attention performance and retain more clothing features and details. This article adopted various strategies to achieve this goal. Firstly, the latent space of the diffusion network was pre-trained to learn semantic correspondences between clothing and the human body. Secondly,  zero-cross-attention mechanism was introduced into the U-Net decoder. Lastly, the multi-head attention was adjusted to an interactive version which learns rich feature correlations through dense interaction mechanisms to enhance the combination of local and global information, reduce information loss, and improve the learning efficiency and stability of the model.
To verify the effectiveness of the proposed method, qualitative and quantitative experiments were conducted on the VITON-HD dataset. Results show that this virtual try-on network generates more realistic overall clothing features and local details compared to other mainstream models. Compared to StableVITON, it improves the average Structural Similarity Index (SSIM) by 1.53%, reduces the average Learned Perceptual Image Patch Similarity (LPIPS) by 0.71%, lowers Fréchet Inception Distance (FID) by 0.15%, and decreases Kernel Inception Distance (KID) by 1.14%. This network effectively preserves clothing feature details and significantly enhances image fidelity and its synthesized try-on images can provide consumers with a better shopping experience and can be widely used in digital fashion applications such as virtual try-on.

Key words: interactive, multi-head attention, StableVITON, virtual try-on, Stable Diffusion

中图分类号: