基于交互式多头注意力机制的虚拟试穿网络

摘要/Abstract

摘要： 为解决现有虚拟试穿模型的服装特征与细节容易丢失（如无法准确识别服装的长短袖、颜色及袖、领口细节）等问题，提出基于交互式多头注意力机制的虚拟试穿网络。在StableVITON模型的服装编码块中引入交互式多头注意力机制，促进不同头之间的交互，学习丰富的特征相关性，从而增强网络注意力性能，保留更多的服装特征和细节。在VITON-HD数据集上进行定性和定量实验，结果表明：与其他主流模型相比，该虚拟试穿网络的服装整体特征、局部细节更加真实；与StableVITON模型相比，该网络在平均结构相似性（SSIM）上提高了1.53%，平均感知图像相似度（LPIPS）降低了0.71%，平均弗雷歇距离（FID）降低了0.15%，平均核Inception距离（KID）降低了1.14%。该虚拟试穿网络有效保留了服装特征细节，显著增强了图像的保真度，其合成的试穿图像能为消费者带来更好的购物体验，可广泛应用于虚拟试穿等数字时尚应用场景。

关键词: 交互式, 多头注意力, StableVITON, 虚拟试穿, 稳定扩散

Abstract: With the booming development of e-commerce and the popularity of online clothing shopping, virtual try-on technology has been significantly promoted. At present, virtual try-on technology is mainly divided into two categories: 3D and 2D images, among which 2D image virtual try-on is widely used due to its easy operation and low cost. This technology is further subdivided into methods based on Generative Adversarial Networks (GANs) and diffusion networks. In recent years, virtual try-on based on diffusion networks has received widespread attention due to its superior performance in realism, stability, and detail processing compared to GAN networks. StableVITON is an important benchmark model in this field and has achieved significant results in synthesizing try-on images by relying on the powerful generation ability of diffusion networks. However, there are still shortcomings in capturing and preserving clothing features and details, such as the inability to accurately identify clothing's long and short sleeves, colors, as well as details such as cuffs and necklines.
To address the problem of clothing feature and detail loss in the StableVITON, this paper proposed a virtual try-on network based on an interactive multi-head attention mechanism. Specifically, this article introduced an interactive multi-head attention mechanism in the clothing encoding block of the StableVITON to facilitate the interaction between different heads and learn rich feature correlations, so as to enhance the network attention performance and retain more clothing features and details. This article adopted various strategies to achieve this goal. Firstly, the latent space of the diffusion network was pre-trained to learn semantic correspondences between clothing and the human body. Secondly, zero-cross-attention mechanism was introduced into the U-Net decoder. Lastly, the multi-head attention was adjusted to an interactive version which learns rich feature correlations through dense interaction mechanisms to enhance the combination of local and global information, reduce information loss, and improve the learning efficiency and stability of the model.
To verify the effectiveness of the proposed method, qualitative and quantitative experiments were conducted on the VITON-HD dataset. Results show that this virtual try-on network generates more realistic overall clothing features and local details compared to other mainstream models. Compared to StableVITON, it improves the average Structural Similarity Index (SSIM) by 1.53%, reduces the average Learned Perceptual Image Patch Similarity (LPIPS) by 0.71%, lowers Fréchet Inception Distance (FID) by 0.15%, and decreases Kernel Inception Distance (KID) by 1.14%. This network effectively preserves clothing feature details and significantly enhances image fidelity and its synthesized try-on images can provide consumers with a better shopping experience and can be widely used in digital fashion applications such as virtual try-on.

Key words: interactive, multi-head attention, StableVITON, virtual try-on, Stable Diffusion

中图分类号:

TS 942. 8

黄丽丽, 郑军红, 金耀, 何利力. 基于交互式多头注意力机制的虚拟试穿网络[J]. 现代纺织技术, 2025, 33(05): 96-106.

HUANG Lili, ZHENG Junhong, JIN Yao, HE Lili. Virtual try-on networks based on interactive multiple attention mechanisms[J]. Advanced Textile Technology, 2025, 33(05): 96-106.

参考文献

[1]薛萧昱,何佳臻,王敏.三维虚拟试衣技术在服装设计与性能评价中的应用进展[J].现代纺织技术,2023,31(2):12-22.
XUE Xiaoyu, HE Jiazhen, WANG Min. Application progress of 3D virtual fitting technology in fashion design and performance evaluation [J]. Advanced Textile Technology, 2023,31(2):12-22.
[2]刘玉叶,王萍.基于纹理特征学习的高精度虚拟试穿智能算法[J].纺织学报,2023,44(5):177-183.
LIU Yuye, WANG Ping. High-precision intelligent algorithm for virtual fitting based on texture feature learning[J]. Journal of Textile Research, 2023, 44(5): 177-183.
[3]KIM J, GU G, PARK M, et al. StableVITON: Learning semantic correspondence with latent diffusion model for virtual try-on,2023: 8176-8185.
[4]蒋高明,刘海桑.服装三维虚拟展示的现状与发展趋势[J].服装学报, 2021,6(4):349-356.
JIANG Gaoming, LIU Haisang. Current situation and development trend of 3D virtual garment display[J]. Journal of Clothing Research, 2021, 6(4): 349-356.
[5]施倩,罗戎蕾.基于生成对抗网络的服装图像生成研究进展[J].现代纺织技术,2023,31(2):36-46.
SHI Qian, LUO Ronglei. Research progress of clothing image generation based on Generative Adversarial Networks[J]. Advanced Textile Technology, 2023, 31(2):36-46.
[6]郭宇轩,孙林.基于扩散模型的ControlNet网络虚拟试衣研究[J].现代纺织技术,2024,32(3):118-128.
GUO Yuxuan, SUN Lin. Virtual fitting research based on the diffusion model and ControlNet network[J]. Advanced Textile Technology, 2024, 32(3): 118-128.
[7]HAN X, WU Z, WU Z, et al. VITON: An image-based virtual try-on network[J].2017.
[8]WANG B, ZHENG H, LIANG X, et al. Toward Characteristic-Preserving Image-Based Virtual Try-On Network[M]//Lecture Notes in Computer Science. Cham: Springer International Publishing, 2018: 607-623.
[9]祖雅妮,张毅.基于大规模预训练文本图像模型的虚拟试穿方法[J].丝绸,2023,60(8):99-106.
ZU Yani, ZHANG Yi. A virtual try-on method based on the large-scale pre-training text-image model[J]. Journal of Silk, 2023, 60(8): 99-106.
[10]CHOI S, PARK S, LEE M, et al. VITON-HD: High-resolution virtual try-on via misalignment-aware normalization[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). June 20-25, 2021, Nashville, TN, USA. IEEE, 2021: 14126-14135.
[11]LEE S, GU G, PARK S, et al. High-resolution Virtual Try-On With Misalignment and Occlusion-Handled Conditions[M]//Lecture Notes in Computer Science. Cham: Springer Nature Switzerland, 2022: 204-219.
[12]SHIM S H, CHUNG J, HEO J P. Towards squeezing-averse virtual try-on via sequential deformation[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, 38(5): 4856-4863.
[13]朱欣娟,徐晨溦. 基于风格迁移的虚拟试穿研究[J]. 纺织高校基础科学学报,2023,36(1):65-71.
ZHU Xinjuan, XU Chenwei. Research on virtual try-on based on style transfer[J]. Basic Sciences Journal of Textile Universities, 2023, 36(1): 65-71.
[14]ZHU L, YANG D, ZHU T, et al. TryOnDiffusion: A tale of two UNets[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). June 17-24, 2023, Vancouver, BC, Canada. IEEE, 2023: 4606-4615.
[15]ZHANG X, LI X, KAMPFFMEYER M, et al. WarpDiffusion: Efficient diffusion model for high-fidelity virtual try-on[EB/OL]. 2023: 2312.03667.https://arxiv.org/abs/2312.03667v1.
[16]ZHANG J, LI K, CHANG S Y, et al. ACDG-VTON: Accurate and contained diffusion generation for virtual try-on[EB/OL]. 2024: 2403.13951.https://arxiv.org/abs/2403.13951v1.
[17]GOU J, SUN S, ZHANG J, et al. Taming the power of diffusion models for high-quality virtual try-on with appearance flow[C]//Proceedings of the 31st ACM International Conference on Multimedia. Ottawa ON Canada. ACM, 2023: 7599-7607.
[18]BOYKOV Y Y, JOLLY M P. Interactive graph cuts for optimal boundary & region segmentation of objects in N-D images[C]//Proceedings Eighth IEEE International Conference on Computer Vision. ICCV. Vancouver, BC, Canada. IEEE, 2001: 105-112.
[19]HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models[J]. Advances in neural information processing systems, 2020, 33: 6840-6851.
[20]SONG J, MENG C, ERMON S. Denoising diffusion implicit models[EB/OL]. 2020: 2010.02502.https://arxiv.org/abs/2010.02502v4.
[21]LIU L, REN Y, LIN Z, et al. Pseudo numerical methods for diffusion models on manifolds[EB/OL]. 2022: 2202.09778.https://arxiv.org/abs/2202.09778v2.
[22]KANG H, YANG M H, RYU J. Interactive multi-head self-attention with linear complexity[EB/OL]. 2024: 2402.17507.https://arxiv.org/abs/2402.17507v1.
[23]KINGMA D P, WELLING M. Auto-encoding variational Bayes[EB/OL]. 2013: 1312.6114.https://arxiv.org/abs/1312.6114v11.
[24]WANG Z, BOVIK A C, SHEIKH H R, et al. Image quality assessment: From error visibility to structural similarity[J]. IEEE Transactions on Image Processing, 2004, 13(4): 600-612.
[25]ZHANG R, ISOLA P, EFROS A A, et al. The unreasonable effectiveness of deep features as a perceptual metric[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA. IEEE, 2018: 586-595.
[26]HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. GANs trained by a two time-scale update rule converge to a local Nash equilibrium[EB/OL]. 2017: 1706.08500.https://arxiv.org/abs/1706.08500v6.
BIŃKOWSKI M, SUTHERLAND D J, ARBEL M, et al. Demystifying MMD GANs[EB/OL]. 2018: 1801.01401. https://arxiv.org/abs/1801.01401v5.

编辑推荐 0

Metrics

阅读次数

全文

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	0	0	0	10

	来源	本网站

	次数	10
	比例	100%

摘要

最新录用	在线预览	正式出版

0	0	20

	来源	本网站

	次数	20
	比例	100%