Neural Video Compression with Spatio-Temporal Cross-Covariance Transformers | Proceedings of the 31st ACM International Conference on Multimedia (2024)

research-article

Authors: Zhenghao Chen, Lucas Relic, Roberto Azevedo, Yang Zhang, + 4, Markus Gross, Dong Xu, Luping Zhou, Christopher Schroers (Less)

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 8543 - 8551

Published: 27 October 2023 Publication History

Metrics

Total Citations4Total Downloads281

Last 12 Months281

Last 6 weeks14

New Citation Alert added!

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

Manage my Alerts

New Citation Alert!

Please log in to your account

Get Access

    • Get Access
    • References
    • Media
    • Tables
    • Share

Abstract

Although existing neural video compression~(NVC) methods have achieved significant success, most of them focus on improving either temporal or spatial information separately. They generally use simple operations such as concatenation or subtraction to utilize this information, while such operations only partially exploit spatio-temporal redundancies. This work aims to effectively and jointly leverage robust temporal and spatial information by proposing a new 3D-based transformer module: Spatio-Temporal Cross-Covariance Transformer (ST-XCT). The ST-XCT module combines two individual extracted features into a joint spatio-temporal feature, followed by 3D convolutional operations and a novel spatio-temporal-aware cross-covariance attention mechanism. Unlike conventional transformers, the cross-covariance attention mechanism is applied across the feature channels without breaking down the spatio-temporal features into local tokens. Such design allows for modeling global cross-channel correlations of the spatio-temporal context while lowering the computational requirement. Based on ST-XCT, we introduce a novel transformer-based end-to-end optimized NVC framework. ST-XCT-based modules are integrated into various key coding components of NVC, such as feature extraction, frame reconstruction, and entropy modeling, demonstrating its generalizability. Extensive experiments show that our ST-XCT-based NVC proposal achieves state-of-the-art compression performances on various standard video benchmark datasets.

References

[1]

[n. d.]. Hevc test model (hm). https://hevc.hhi.fraunhofer.de/HM-doc/. Accessed: 2023-03-06.

[2]

[n. d.]. Ultra video group test sequences. http://ultravideo.cs.tut.fi. Accessed: 2023-03-06.

[3]

[n. d.]. VVC Reference Model (VTM). https://vcgit.hhi.fraunhofer.de/jvet/ VVCSoftware_VTM/. Accessed: 2023-03-06.

[4]

Eirikur Agustsson, David Minnen, Nick Johnston, Johannes Balle, Sung Jin Hwang, and George Toderici. 2020. Scale-Space Flow for End-to-End Optimized Video Compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8503--8512.

[5]

Alaaeldin Ali, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, et al. 2021. Xcit: Cross-covariance image transformers. Advances in neural information processing systems, Vol. 34 (2021), 20014--20027.

[6]

Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin Hwang, and Nick Johnston. 2018. Variational image compression with a scale hyperprior. International Conference on Learning Representations (ICLR) (2018).

[7]

Fabrice Bellard. 2015. BPG Image format. URL https://bellard.org/bpg (2015).

[8]

Benjamin Bross, Ye-Kui Wang, Yan Ye, Shan Liu, Jianle Chen, Gary J Sullivan, and Jens-Rainer Ohm. 2021. Overview of the versatile video coding (VVC) standard and its applications. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 31, 10 (2021), 3736--3764.

[9]

Zhenghao Chen, Shuhang Gu, Guo Lu, and Dong Xu. 2022a. Exploiting intra-slice and inter-slice redundancy for learning-based lossless volumetric image compression. IEEE Transactions on Image Processing, Vol. 31 (2022), 1697--1707.

[10]

Zhenghao Chen, Guo Lu, Zhihao Hu, Shan Liu, Wei Jiang, and Dong Xu. 2022b. LSVC: A learning-based stereo video compression framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6073--6082.

[11]

Zhengxue Cheng, Heming Sun, Masaru Takeuchi, and Jiro Katto. 2020. Learned image compression with discretized gaussian mixture likelihoods and attention modules. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7939--7948.

[12]

Abdelaziz Djelouah, Joaquim Campos, Simone Schaub-Meyer, and Christopher Schroers. 2019. Neural inter-frame compression for video coding. In Proceedings of the IEEE International Conference on Computer Vision. 6421--6429.

[13]

Amirhossein Habibian, Ties van Rozendaal, Jakub M Tomczak, and Taco S Cohen. 2019. Video compression with rate-distortion autoencoders. In Proceedings of the IEEE International Conference on Computer Vision. 7033--7042.

[14]

Zhihao Hu, Zhenghao Chen, Dong Xu, Guo Lu, Wanli Ouyang, and Shuhang Gu. 2020. Improving deep video compression by resolution-adaptive flow coding. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part II 16. Springer, 193--209.

Digital Library

[15]

Zhihao Hu, Guo Lu, Jinyang Guo, Shan Liu, Wei Jiang, and Dong Xu. 2022. Coarse-to-fine Deep Video Coding with Hyperprior-guided Mode Prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]

Zhihao Hu, Guo Lu, and Dong Xu. 2021. FVC: A New Framework towards Deep Video Compression in Feature Space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1502--1511.

[17]

Jiahao Li, Bin Li, and Yan Lu. 2021. Deep contextual video compression. Advances in Neural Information Processing Systems, Vol. 34 (2021), 18114--18125.

[18]

Jiahao Li, Bin Li, and Yan Lu. 2022. Hybrid spatial-temporal entropy modelling for neural video compression. In Proceedings of the 30th ACM International Conference on Multimedia. 1503--1511.

Digital Library

[19]

Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. 2021. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF international conference on computer vision. 1833--1844.

[20]

Jingyun Liang, Yuchen Fan, Xiaoyu Xiang, Rakesh Ranjan, Eddy Ilg, Simon Green, Jiezhang Cao, Kai Zhang, Radu Timofte, and Luc V Gool. 2022. Recurrent video restoration transformer with guided deformable attention. Advances in Neural Information Processing Systems, Vol. 35 (2022), 378--393.

[21]

Jianping Lin, Dong Liu, Houqiang Li, and Feng Wu. 2020. M-LVC: Multiple Frames Prediction for Learned Video Compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3546--3554.

[22]

Lei Liu, Zhihao Hu, Zhenghao Chen, and Dong Xu. 2023. ICMH-Net: Neural Image Compression Towards both Machine Vision and Human Vision. In Proceedings of the 31th ACM International Conference on Multimedia. ACM. https://doi.org/10.1145/3581783.3612041

Digital Library

[23]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision. 10012--10022.

[24]

Salvator Lombardo, Jun Han, Christopher Schroers, and Stephan Mandt. 2019. Deep generative video compression. In Advances in Neural Information Processing Systems. 9287--9298.

[25]

Guo Lu, Chunlei Cai, Xiaoyun Zhang, Li Chen, Wanli Ouyang, Dong Xu, and Zhiyong Gao. 2020a. Content adaptive and error propagation aware deep video compression. In European Conference on Computer Vision. Springer, 456--472.

Digital Library

[26]

Guo Lu, Wanli Ouyang, Dong Xu, Xiaoyun Zhang, Chunlei Cai, and Zhiyong Gao. 2019. DVC: An end-to-end deep video compression framework. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11006--11015.

[27]

Guo Lu, Xiaoyun Zhang, Wanli Ouyang, Li Chen, Zhiyong Gao, and Dong Xu. [n.,d.]. An End-to-End Learning Framework for Video Compression. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. in Press ([n.,d.]), 1-1. https://doi.org/10.1109/TPAMI.2020.2988453

[28]

Guo Lu, Xiaoyun Zhang, Wanli Ouyang, Li Chen, Zhiyong Gao, and Dong Xu. 2020b. An end-to-end learning framework for video compression. IEEE transactions on pattern analysis and machine intelligence, Vol. 43, 10 (2020), 3292--3308.

[29]

Ming Lu, Peiyao Guo, Huiqing Shi, Chuntong Cao, and Zhan Ma. 2022. Transformer-based Image Compression. (2022), 469--469.

[30]

Fabian Mentzer, Eirikur Agustsson, Michael Tschannen, Radu Timofte, and Luc Van Gool. 2019. Practical Full Resolution Learned Lossless Image Compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]

Fabian Mentzer, George D Toderici, David Minnen, Sergi Caelles, Sung Jin Hwang, Mario Lucic, and Eirikur Agustsson. 2022. VCT: A Video Compression Transformer. Advances in Neural Information Processing Systems, Vol. 35 (2022), 13091--13103.

[32]

David Minnen, Johannes Ballé, and George D Toderici. 2018. Joint autoregressive and hierarchical priors for learned image compression. In Advances in Neural Information Processing Systems. 10771--10780.

[33]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library.

[34]

Yichen Qian, Ming Lin, Xiuyu Sun, Zhiyu Tan, and Rong Jin. 2022. Entroformer: A Transformer-based Entropy Model for Learned Image Compression. (May 2022).

[35]

Anurag Ranjan and Michael J Black. 2017. Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4161--4170.

[36]

Xihua Sheng, Jiahao Li, Bin Li, Li Li, Dong Liu, and Yan Lu. 2022. Temporal context mining for learned video compression. IEEE Transactions on Multimedia (2022).

Digital Library

[37]

Mingyang Song, Yang Zhang, and Tunc O Aydin. 2022. TempFormer: Temporally Consistent Transformer for Video Denoising. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XIX. Springer, 481--496.

[38]

Gary J Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand. 2012. Overview of the high efficiency video coding (HEVC) standard. IEEE Transactions on circuits and systems for video technology, Vol. 22, 12 (2012), 1649--1668.

Digital Library

[39]

David S Taubman and Michael W Marcellin. 2002. JPEG2000: Standard for interactive imaging. Proc. IEEE, Vol. 90, 8 (2002), 1336--1357.

[40]

Lucas Theis, Wenzhe Shi, Andrew Cunningham, and Ferenc Huszár. 2017. Lossy image compression with compressive autoencoders. International Conference for Learning Representations (2017).

[41]

George Toderici, Damien Vincent, Nick Johnston, Sung Jin Hwang, David Minnen, Joel Shor, and Michele Covell. 2017. Full resolution image compression with recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5306--5314.

[42]

Gregory K Wallace. 1992. The JPEG still picture compression standard. IEEE transactions on consumer electronics, Vol. 38, 1 (1992), xviii--xxxiv.

Digital Library

[43]

Haiqiang Wang, Weihao Gan, Sudeng Hu, Joe Yuchieh Lin, Lina Jin, Longguang Song, Ping Wang, Ioannis Katsavounidis, Anne Aaron, and C-C Jay Kuo. 2016. MCL-JCV: a JND-based H. 264/AVC video quality assessment dataset. In 2016 IEEE International Conference on Image Processing (ICIP). IEEE, 1509--1513.

[44]

Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and William T Freeman. 2019. Video enhancement with task-oriented flow. International Journal of Computer Vision, Vol. 127, 8 (2019), 1106--1125.

Digital Library

[45]

Ren Yang, Fabian Mentzer, Luc Van Gool, and Radu Timofte. 2020. Learning for video compression with hierarchical quality and recurrent enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6628--6637.

[46]

Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. 2022. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5728--5739.

[47]

Yinhao Zhu, Yang Yang, and Taco Cohen. 2022. Transformer-based transform coding. In International Conference on Learning Representations.

Cited By

View all

  • Liu XWang MWang SKwong S(2024)Bilateral Context Modeling for Residual Coding in Lossless 3D Medical Image CompressionIEEE Transactions on Image Processing10.1109/TIP.2024.337891033(2502-2513)Online publication date: 2024
  • Wali IKessentini AMasmoudi N(2024)CNN-based intra partitioning process for spatial and SNR scalability for SHVCMultimedia Tools and Applications10.1007/s11042-024-19179-8Online publication date: 18-Apr-2024
  • Liu LHu ZChen Z(2024)Towards Point Cloud Compression forMachine Perception: A Simple andStrong Baseline byLearning theOctree Depth Level PredictorGeneralizing from Limited Resources in the Open World10.1007/978-981-97-6125-8_1(3-17)Online publication date: 28-Jul-2024
  • Show More Cited By

Index Terms

  1. Neural Video Compression with Spatio-Temporal Cross-Covariance Transformers

    1. Computing methodologies

      1. Artificial intelligence

        1. Computer vision

        2. Computer graphics

          1. Image compression

      Recommendations

      • Hybrid Spatial-Temporal Entropy Modelling for Neural Video Compression

        MM '22: Proceedings of the 30th ACM International Conference on Multimedia

        For neural video codec, it is critical, yet challenging, to design an efficient entropy model which can accurately predict the probability distribution of the quantized latent representation. However, most existing video codecs directly use the ready-...

        Read More

      • Spatio-temporal scalability-based motion-compensated 3-D subband/DCT video coding

        The existing standard video coding schemes support spatial scalability because of its prospective applications. Unfortunately, spatial scalable codecs produce high bit rate overhead as compared to a single layer coder. In this paper, we propose a spatio-...

        Read More

      • Spatio–Temporal Regularity Flow (SPREF): Its Estimation and Applications

        Feature selection and extraction is a key operation in video analysis for achieving a higher level of abstraction. In this paper, we introduce a general framework to extract a new spatio-temporal feature that represents the directions in which a video ...

        Read More

      Comments

      Information & Contributors

      Information

      Published In

      Neural Video Compression with Spatio-Temporal Cross-Covariance Transformers | Proceedings of the 31st ACM International Conference on Multimedia (9)

      MM '23: Proceedings of the 31st ACM International Conference on Multimedia

      October 2023

      9913 pages

      ISBN:9798400701085

      DOI:10.1145/3581783

      • General Chairs:
      • Abdulmotaleb El Saddik

        University of Ottawa, Canada & MBZUAI, UAE

        ,
      • Tao Mei

        HiDream.ai, China

        ,
      • Rita Cucchiara

        University of Modena and Reggio Emilia, Italy

        ,
      • Program Chairs:
      • Marco Bertini

        University of Florence, Italy

        ,
      • Diana Patricia Tobon Vallejo

        Unversidad de Medellin, Colombia

        ,
      • Pradeep K. Atrey

        University at Albany, State University of New York, USA

        ,
      • M. Shamim Hossain

        M. Shamim Hossain (King Saud University, KSA

      Copyright © 2023 ACM.

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [emailprotected].

      Sponsors

      • SIGMM: ACM Special Interest Group on Multimedia

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 27 October 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. neural network
      2. transformer
      3. video compression

      Qualifiers

      • Research-article

      Conference

      MM '23

      Sponsor:

      • SIGMM

      MM '23: The 31st ACM International Conference on Multimedia

      October 29 - November 3, 2023

      Ottawa ON, Canada

      Acceptance Rates

      Overall Acceptance Rate 995 of 4,171 submissions, 24%

      Upcoming Conference

      MM '24

      • Sponsor:
      • sigmm

      The 32nd ACM International Conference on Multimedia

      October 28 - November 1, 2024

      Melbourne , VIC , Australia

      Contributors

      Neural Video Compression with Spatio-Temporal Cross-Covariance Transformers | Proceedings of the 31st ACM International Conference on Multimedia (17)

      Other Metrics

      View Article Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 4

        Total Citations

        View Citations
      • 281

        Total Downloads

      • Downloads (Last 12 months)281
      • Downloads (Last 6 weeks)14

      Reflects downloads up to 04 Oct 2024

      Other Metrics

      View Author Metrics

      Citations

      Cited By

      View all

      • Liu XWang MWang SKwong S(2024)Bilateral Context Modeling for Residual Coding in Lossless 3D Medical Image CompressionIEEE Transactions on Image Processing10.1109/TIP.2024.337891033(2502-2513)Online publication date: 2024
      • Wali IKessentini AMasmoudi N(2024)CNN-based intra partitioning process for spatial and SNR scalability for SHVCMultimedia Tools and Applications10.1007/s11042-024-19179-8Online publication date: 18-Apr-2024
      • Liu LHu ZChen Z(2024)Towards Point Cloud Compression forMachine Perception: A Simple andStrong Baseline byLearning theOctree Depth Level PredictorGeneralizing from Limited Resources in the Open World10.1007/978-981-97-6125-8_1(3-17)Online publication date: 28-Jul-2024
      • Liu LHu ZChen ZXu DEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)ICMH-Net: Neural Image Compression Towards both Machine Vision and Human VisionProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612041(8047-8056)Online publication date: 26-Oct-2023

        https://dl.acm.org/doi/10.1145/3581783.3612041

      View Options

      Get Access

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      Get this Publication

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Neural Video Compression with Spatio-Temporal Cross-Covariance Transformers | Proceedings of the 31st ACM International Conference on Multimedia (2024)

      References

      Top Articles
      Latest Posts
      Recommended Articles
      Article information

      Author: Rubie Ullrich

      Last Updated:

      Views: 6100

      Rating: 4.1 / 5 (52 voted)

      Reviews: 91% of readers found this page helpful

      Author information

      Name: Rubie Ullrich

      Birthday: 1998-02-02

      Address: 743 Stoltenberg Center, Genovevaville, NJ 59925-3119

      Phone: +2202978377583

      Job: Administration Engineer

      Hobby: Surfing, Sailing, Listening to music, Web surfing, Kitesurfing, Geocaching, Backpacking

      Introduction: My name is Rubie Ullrich, I am a enthusiastic, perfect, tender, vivacious, talented, famous, delightful person who loves writing and wants to share my knowledge and understanding with you.