NVIDIA releases TensorRT-LLM: H100 GPU inference performance soars 8x, sets another AI arithmetic record

Today, NVIDIA officially announced a major breakthrough in the form of a deep optimization open source library called TensorRT-LLM, which is designed to significantly improve the performance of its Hopper and otherAI Inference performance for all large language models on GPUs.

NVIDIA releases TensorRT-LLM: H100 GPU inference performance soars 8x, sets another AI arithmetic record

NVIDIA is now actively working with the open source community to optimize its GPUs with AI kernels using advanced technologies including SmoothQuant, FlashAttention, and fMHA, enabling acceleration of models such as GPT-3 (175 B), Llama Falcom (180 B), and Bloom.

NVIDIA releases TensorRT-LLM: H100 GPU inference performance soars 8x, sets another AI arithmetic record

A key feature of TensorRT-LLM is the introduction of a scheduling scheme called "In-Flight Batching", which allows the GPU to dynamically process multiple smaller queries simultaneously while handling large-scale compute-intensive requests. The implementation of this scheme has dramatically improved GPU processing performance, with H100 GPUs delivering up to 2x faster throughput than before.

In performance tests, NVIDIA compared H100 and TensorRT-LLM-enabled H100 using A100 as the baseline. in the inference of the GPT-J 6B model, H100 outperforms A100 by a factor of 4, and TensorRT-LLM-enabled H100 achieves a factor of 8 performance over A100.

In the Llama 2 model, the inference performance of H100 improves by 2.6x compared to A100, while TensorRT-LLM-enabled H100 performs up to 4.6x of A100.

This breakthrough once again highlights NVIDIA's technological prowess in the field of AI computing, which will provide powerful computing support for even more powerful AI applications in the future. The original detailed report has been released, and readers interested in this can delve further into it.

Report Address:developer.nvidia.com/blog/nvidia-tensorrt-llm-supercharges-large-language-model-inference-on-nvidia-h100-gpus/

This article comes from users or anonymous contributions, does not represent the position of Mass Intelligence; all content (including images, videos, etc.) in this article are copyrighted by the original author. Please refer to this site for the relevant issues involvedstatement denying or limiting responsibilityPlease contact the operator of this website for any infringement of rights (Contact Us) We will handle this as stated. Link to this article: https://dzzn.com/en/2023/890.html

Like (0)
Previous September 10, 2023 am11:30 am
Next September 10, 2023 at 12:02 pm

Recommended

Leave a Reply

Please Login to Comment