research-article
Authors: Zhenghao Chen, Lucas Relic, Roberto Azevedo, Yang Zhang, + 4, Markus Gross, Dong Xu, Luping Zhou, Christopher Schroers (Less)
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
Pages 8543 - 8551
Published: 27 October 2023 Publication History
Metrics
Total Citations4Total Downloads281Last 12 Months281
Last 6 weeks14
New Citation Alert added!
This alert has been successfully added and will be sent to:
You will be notified whenever a record that you have chosen has been cited.
To manage your alert preferences, click on the button below.
Manage my Alerts
New Citation Alert!
Please log in to your account
Get Access
- Get Access
- References
- Media
- Tables
- Share
Abstract
Although existing neural video compression~(NVC) methods have achieved significant success, most of them focus on improving either temporal or spatial information separately. They generally use simple operations such as concatenation or subtraction to utilize this information, while such operations only partially exploit spatio-temporal redundancies. This work aims to effectively and jointly leverage robust temporal and spatial information by proposing a new 3D-based transformer module: Spatio-Temporal Cross-Covariance Transformer (ST-XCT). The ST-XCT module combines two individual extracted features into a joint spatio-temporal feature, followed by 3D convolutional operations and a novel spatio-temporal-aware cross-covariance attention mechanism. Unlike conventional transformers, the cross-covariance attention mechanism is applied across the feature channels without breaking down the spatio-temporal features into local tokens. Such design allows for modeling global cross-channel correlations of the spatio-temporal context while lowering the computational requirement. Based on ST-XCT, we introduce a novel transformer-based end-to-end optimized NVC framework. ST-XCT-based modules are integrated into various key coding components of NVC, such as feature extraction, frame reconstruction, and entropy modeling, demonstrating its generalizability. Extensive experiments show that our ST-XCT-based NVC proposal achieves state-of-the-art compression performances on various standard video benchmark datasets.
References
[1]
[n. d.]. Hevc test model (hm). https://hevc.hhi.fraunhofer.de/HM-doc/. Accessed: 2023-03-06.
[2]
[n. d.]. Ultra video group test sequences. http://ultravideo.cs.tut.fi. Accessed: 2023-03-06.
[3]
[n. d.]. VVC Reference Model (VTM). https://vcgit.hhi.fraunhofer.de/jvet/ VVCSoftware_VTM/. Accessed: 2023-03-06.
[4]
Eirikur Agustsson, David Minnen, Nick Johnston, Johannes Balle, Sung Jin Hwang, and George Toderici. 2020. Scale-Space Flow for End-to-End Optimized Video Compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8503--8512.
[5]
Alaaeldin Ali, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, et al. 2021. Xcit: Cross-covariance image transformers. Advances in neural information processing systems, Vol. 34 (2021), 20014--20027.
[6]
Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin Hwang, and Nick Johnston. 2018. Variational image compression with a scale hyperprior. International Conference on Learning Representations (ICLR) (2018).
[7]
Fabrice Bellard. 2015. BPG Image format. URL https://bellard.org/bpg (2015).
[8]
Benjamin Bross, Ye-Kui Wang, Yan Ye, Shan Liu, Jianle Chen, Gary J Sullivan, and Jens-Rainer Ohm. 2021. Overview of the versatile video coding (VVC) standard and its applications. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 31, 10 (2021), 3736--3764.
[9]
Zhenghao Chen, Shuhang Gu, Guo Lu, and Dong Xu. 2022a. Exploiting intra-slice and inter-slice redundancy for learning-based lossless volumetric image compression. IEEE Transactions on Image Processing, Vol. 31 (2022), 1697--1707.
[10]
Zhenghao Chen, Guo Lu, Zhihao Hu, Shan Liu, Wei Jiang, and Dong Xu. 2022b. LSVC: A learning-based stereo video compression framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6073--6082.
[11]
Zhengxue Cheng, Heming Sun, Masaru Takeuchi, and Jiro Katto. 2020. Learned image compression with discretized gaussian mixture likelihoods and attention modules. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7939--7948.
[12]
Abdelaziz Djelouah, Joaquim Campos, Simone Schaub-Meyer, and Christopher Schroers. 2019. Neural inter-frame compression for video coding. In Proceedings of the IEEE International Conference on Computer Vision. 6421--6429.
[13]
Amirhossein Habibian, Ties van Rozendaal, Jakub M Tomczak, and Taco S Cohen. 2019. Video compression with rate-distortion autoencoders. In Proceedings of the IEEE International Conference on Computer Vision. 7033--7042.
[14]
Zhihao Hu, Zhenghao Chen, Dong Xu, Guo Lu, Wanli Ouyang, and Shuhang Gu. 2020. Improving deep video compression by resolution-adaptive flow coding. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part II 16. Springer, 193--209.
Digital Library
[15]
Zhihao Hu, Guo Lu, Jinyang Guo, Shan Liu, Wei Jiang, and Dong Xu. 2022. Coarse-to-fine Deep Video Coding with Hyperprior-guided Mode Prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[16]
Zhihao Hu, Guo Lu, and Dong Xu. 2021. FVC: A New Framework towards Deep Video Compression in Feature Space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1502--1511.
[17]
Jiahao Li, Bin Li, and Yan Lu. 2021. Deep contextual video compression. Advances in Neural Information Processing Systems, Vol. 34 (2021), 18114--18125.
[18]
Jiahao Li, Bin Li, and Yan Lu. 2022. Hybrid spatial-temporal entropy modelling for neural video compression. In Proceedings of the 30th ACM International Conference on Multimedia. 1503--1511.
Digital Library
[19]
Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. 2021. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF international conference on computer vision. 1833--1844.
[20]
Jingyun Liang, Yuchen Fan, Xiaoyu Xiang, Rakesh Ranjan, Eddy Ilg, Simon Green, Jiezhang Cao, Kai Zhang, Radu Timofte, and Luc V Gool. 2022. Recurrent video restoration transformer with guided deformable attention. Advances in Neural Information Processing Systems, Vol. 35 (2022), 378--393.
[21]
Jianping Lin, Dong Liu, Houqiang Li, and Feng Wu. 2020. M-LVC: Multiple Frames Prediction for Learned Video Compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3546--3554.
[22]
Lei Liu, Zhihao Hu, Zhenghao Chen, and Dong Xu. 2023. ICMH-Net: Neural Image Compression Towards both Machine Vision and Human Vision. In Proceedings of the 31th ACM International Conference on Multimedia. ACM. https://doi.org/10.1145/3581783.3612041
Digital Library
[23]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision. 10012--10022.
[24]
Salvator Lombardo, Jun Han, Christopher Schroers, and Stephan Mandt. 2019. Deep generative video compression. In Advances in Neural Information Processing Systems. 9287--9298.
[25]
Guo Lu, Chunlei Cai, Xiaoyun Zhang, Li Chen, Wanli Ouyang, Dong Xu, and Zhiyong Gao. 2020a. Content adaptive and error propagation aware deep video compression. In European Conference on Computer Vision. Springer, 456--472.
Digital Library
[26]
Guo Lu, Wanli Ouyang, Dong Xu, Xiaoyun Zhang, Chunlei Cai, and Zhiyong Gao. 2019. DVC: An end-to-end deep video compression framework. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11006--11015.
[27]
Guo Lu, Xiaoyun Zhang, Wanli Ouyang, Li Chen, Zhiyong Gao, and Dong Xu. [n.,d.]. An End-to-End Learning Framework for Video Compression. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. in Press ([n.,d.]), 1-1. https://doi.org/10.1109/TPAMI.2020.2988453
[28]
Guo Lu, Xiaoyun Zhang, Wanli Ouyang, Li Chen, Zhiyong Gao, and Dong Xu. 2020b. An end-to-end learning framework for video compression. IEEE transactions on pattern analysis and machine intelligence, Vol. 43, 10 (2020), 3292--3308.
[29]
Ming Lu, Peiyao Guo, Huiqing Shi, Chuntong Cao, and Zhan Ma. 2022. Transformer-based Image Compression. (2022), 469--469.
[30]
Fabian Mentzer, Eirikur Agustsson, Michael Tschannen, Radu Timofte, and Luc Van Gool. 2019. Practical Full Resolution Learned Lossless Image Compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[31]
Fabian Mentzer, George D Toderici, David Minnen, Sergi Caelles, Sung Jin Hwang, Mario Lucic, and Eirikur Agustsson. 2022. VCT: A Video Compression Transformer. Advances in Neural Information Processing Systems, Vol. 35 (2022), 13091--13103.
[32]
David Minnen, Johannes Ballé, and George D Toderici. 2018. Joint autoregressive and hierarchical priors for learned image compression. In Advances in Neural Information Processing Systems. 10771--10780.
[33]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library.
[34]
Yichen Qian, Ming Lin, Xiuyu Sun, Zhiyu Tan, and Rong Jin. 2022. Entroformer: A Transformer-based Entropy Model for Learned Image Compression. (May 2022).
[35]
Anurag Ranjan and Michael J Black. 2017. Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4161--4170.
[36]
Xihua Sheng, Jiahao Li, Bin Li, Li Li, Dong Liu, and Yan Lu. 2022. Temporal context mining for learned video compression. IEEE Transactions on Multimedia (2022).
Digital Library
[37]
Mingyang Song, Yang Zhang, and Tunc O Aydin. 2022. TempFormer: Temporally Consistent Transformer for Video Denoising. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XIX. Springer, 481--496.
[38]
Gary J Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand. 2012. Overview of the high efficiency video coding (HEVC) standard. IEEE Transactions on circuits and systems for video technology, Vol. 22, 12 (2012), 1649--1668.
Digital Library
[39]
David S Taubman and Michael W Marcellin. 2002. JPEG2000: Standard for interactive imaging. Proc. IEEE, Vol. 90, 8 (2002), 1336--1357.
[40]
Lucas Theis, Wenzhe Shi, Andrew Cunningham, and Ferenc Huszár. 2017. Lossy image compression with compressive autoencoders. International Conference for Learning Representations (2017).
[41]
George Toderici, Damien Vincent, Nick Johnston, Sung Jin Hwang, David Minnen, Joel Shor, and Michele Covell. 2017. Full resolution image compression with recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5306--5314.
[42]
Gregory K Wallace. 1992. The JPEG still picture compression standard. IEEE transactions on consumer electronics, Vol. 38, 1 (1992), xviii--xxxiv.
Digital Library
[43]
Haiqiang Wang, Weihao Gan, Sudeng Hu, Joe Yuchieh Lin, Lina Jin, Longguang Song, Ping Wang, Ioannis Katsavounidis, Anne Aaron, and C-C Jay Kuo. 2016. MCL-JCV: a JND-based H. 264/AVC video quality assessment dataset. In 2016 IEEE International Conference on Image Processing (ICIP). IEEE, 1509--1513.
[44]
Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and William T Freeman. 2019. Video enhancement with task-oriented flow. International Journal of Computer Vision, Vol. 127, 8 (2019), 1106--1125.
Digital Library
[45]
Ren Yang, Fabian Mentzer, Luc Van Gool, and Radu Timofte. 2020. Learning for video compression with hierarchical quality and recurrent enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6628--6637.
[46]
Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. 2022. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5728--5739.
[47]
Yinhao Zhu, Yang Yang, and Taco Cohen. 2022. Transformer-based transform coding. In International Conference on Learning Representations.
Cited By
View all
- Liu XWang MWang SKwong S(2024)Bilateral Context Modeling for Residual Coding in Lossless 3D Medical Image CompressionIEEE Transactions on Image Processing10.1109/TIP.2024.337891033(2502-2513)Online publication date: 2024
- Wali IKessentini AMasmoudi N(2024)CNN-based intra partitioning process for spatial and SNR scalability for SHVCMultimedia Tools and Applications10.1007/s11042-024-19179-8Online publication date: 18-Apr-2024
- Liu LHu ZChen Z(2024)Towards Point Cloud Compression forMachine Perception: A Simple andStrong Baseline byLearning theOctree Depth Level PredictorGeneralizing from Limited Resources in the Open World10.1007/978-981-97-6125-8_1(3-17)Online publication date: 28-Jul-2024
- Show More Cited By
Index Terms
Neural Video Compression with Spatio-Temporal Cross-Covariance Transformers
Computing methodologies
Artificial intelligence
Computer vision
Computer graphics
Image compression
Recommendations
- Hybrid Spatial-Temporal Entropy Modelling for Neural Video Compression
MM '22: Proceedings of the 30th ACM International Conference on Multimedia
For neural video codec, it is critical, yet challenging, to design an efficient entropy model which can accurately predict the probability distribution of the quantized latent representation. However, most existing video codecs directly use the ready-...
Read More
- Spatio-temporal scalability-based motion-compensated 3-D subband/DCT video coding
The existing standard video coding schemes support spatial scalability because of its prospective applications. Unfortunately, spatial scalable codecs produce high bit rate overhead as compared to a single layer coder. In this paper, we propose a spatio-...
Read More
- Spatio–Temporal Regularity Flow (SPREF): Its Estimation and Applications
Feature selection and extraction is a key operation in video analysis for achieving a higher level of abstraction. In this paper, we introduce a general framework to extract a new spatio-temporal feature that represents the directions in which a video ...
Read More
Comments
Information & Contributors
Information
Published In
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
- General Chairs:
- Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
, - Tao Mei
HiDream.ai, China
, - Rita Cucchiara
University of Modena and Reggio Emilia, Italy
, - Program Chairs:
- Marco Bertini
University of Florence, Italy
, - Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
, - Pradeep K. Atrey
University at Albany, State University of New York, USA
, - M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA
Copyright © 2023 ACM.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [emailprotected].
Sponsors
- SIGMM: ACM Special Interest Group on Multimedia
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
Published: 27 October 2023
Permissions
Request permissions for this article.
Check for updates
Author Tags
- neural network
- transformer
- video compression
Qualifiers
- Research-article
Conference
MM '23
Sponsor:
- SIGMM
MM '23: The 31st ACM International Conference on Multimedia
October 29 - November 3, 2023
Ottawa ON, Canada
Acceptance Rates
Overall Acceptance Rate 995 of 4,171 submissions, 24%
Contributors
Other Metrics
View Article Metrics
Bibliometrics & Citations
Bibliometrics
Article Metrics
- View Citations
4
Total Citations
281
Total Downloads
- Downloads (Last 12 months)281
- Downloads (Last 6 weeks)14
Reflects downloads up to 04 Oct 2024
Other Metrics
View Author Metrics
Citations
Cited By
View all
- Liu XWang MWang SKwong S(2024)Bilateral Context Modeling for Residual Coding in Lossless 3D Medical Image CompressionIEEE Transactions on Image Processing10.1109/TIP.2024.337891033(2502-2513)Online publication date: 2024
- Wali IKessentini AMasmoudi N(2024)CNN-based intra partitioning process for spatial and SNR scalability for SHVCMultimedia Tools and Applications10.1007/s11042-024-19179-8Online publication date: 18-Apr-2024
- Liu LHu ZChen Z(2024)Towards Point Cloud Compression forMachine Perception: A Simple andStrong Baseline byLearning theOctree Depth Level PredictorGeneralizing from Limited Resources in the Open World10.1007/978-981-97-6125-8_1(3-17)Online publication date: 28-Jul-2024
- Liu LHu ZChen ZXu DEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)ICMH-Net: Neural Image Compression Towards both Machine Vision and Human VisionProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612041(8047-8056)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612041
View Options
Get Access
Login options
Check if you have access through your login credentials or your institution to get full access on this article.
Sign in
Full Access
Get this Publication
View options
View or Download as a PDF file.
PDFeReader
View online with eReader.
eReaderMedia
Figures
Other
Tables