NVIDIA releases TensorRT-LLM: H100 GPU inference performance soars 8x, sets another AI arithmetic record

Today, NVIDIA officially announced a major breakthrough in the form of a deep optimization open source library called TensorRT-LLM, which is designed to significantly improve the performance of its Hopper and otherAI Inference performance for all large language models on GPUs.

NVIDIA is now actively working with the open source community to optimize its GPUs with AI kernels using advanced technologies including SmoothQuant, FlashAttention, and fMHA, enabling acceleration of models such as GPT-3 (175 B), Llama Falcom (180 B), and Bloom.

A key feature of TensorRT-LLM is the introduction of a scheduling scheme called "In-Flight Batching", which allows the GPU to dynamically process multiple smaller queries simultaneously while handling large-scale compute-intensive requests. The implementation of this scheme has dramatically improved GPU processing performance, with H100 GPUs delivering up to 2x faster throughput than before.

In performance tests, NVIDIA compared H100 and TensorRT-LLM-enabled H100 using A100 as the baseline. in the inference of the GPT-J 6B model, H100 outperforms A100 by a factor of 4, and TensorRT-LLM-enabled H100 achieves a factor of 8 performance over A100.

In the Llama 2 model, the inference performance of H100 improves by 2.6x compared to A100, while TensorRT-LLM-enabled H100 performs up to 4.6x of A100.

This breakthrough once again highlights NVIDIA's technological prowess in the field of AI computing, which will provide powerful computing support for even more powerful AI applications in the future. The original detailed report has been released, and readers interested in this can delve further into it.

Report Address:developer.nvidia.com/blog/nvidia-tensorrt-llm-supercharges-large-language-model-inference-on-nvidia-h100-gpus/

This article comes from users or anonymous contributions, does not represent the position of Mass Intelligence; all content (including images, videos, etc.) in this article are copyrighted by the original author. Please refer to this site for the relevant issues involvedstatement denying or limiting responsibilityPlease contact the operator of this website for any infringement of rights (Contact Us) We will handle this as stated. Link to this article: https://dzzn.com/en/2023/890.html

NVIDIA releases TensorRT-LLM: H100 GPU inference performance soars 8x, sets another AI arithmetic record

About the Author.

Alan Turing (1912-1954), English mathematician, considered as the father of computer sciencecontent co-creator

Leave a Reply

NVIDIA releases TensorRT-LLM: H100 GPU inference performance soars 8x, sets another AI arithmetic record

About the Author.

Alan Turing (1912-1954), English mathematician, considered as the father of computer sciencecontent co-creator

Recommended

Google Bard has support for plug-in capabilities that will dramatically improve user productivity and collaboration

Nokia Dubai Innovation Lab: leading a revolution in the use of AI technology for network automation in the MEA region

Samsung and Microsoft team up to develop AI chatbots to boost enterprise efficiency

UK Ruling Declares AIs Not Allowed to Be Listed as "Inventors" in Patent Applications

AI is 100 times more efficient! Numenta announces NuPIC, a brain-based artificial intelligence product!

ByteDance Launches Coze Buckle, a One-Stop AI Development Platform: Accelerating AI Robot Development

Leave a Reply