Supercharging Llama 3.1 across NVIDIA Platforms

Meta’s Llama collection of large language models are the most popular foundation models in the open-source community today, supporting a variety of use cases. Millions of developers worldwide are building derivative models, and are integrating these into their applications.

With Llama 3.1, Meta is launching a suite of large language models (LLMs) as well as a suite of trust and safety models to ensure safe LLM responses.

Meta engineers trained Llama 3 on NVIDIA H100 Tensor Core GPUs. They significantly optimized their full training stack and pushed model training to over 16K H100 GPUs, making the 405B the first Llama model trained at this scale.

We are excited to announce that the Llama 3.1 collection is optimized for the 100M+ GPUs worldwide, across all of the NVIDIA platforms—from datacenters to the edge and PCs.

Accelerating Llama 3.1 on the NVIDIA-accelerated computing platform

The latest NVIDIA H200 Tensor Core GPUs, running TensorRT-LLM, deliver outstanding inference performance on Llama 3.1-405B. With the large HBM3e memory capacity of the H200 GPU, the model fits comfortably in a single HGX H200 with eight H200 GPUs. Fourth-generation NVLink and third-generation NVSwitch accelerate inference throughput when running large models, like Llama 3.1-405B, by providing high-bandwidth communication 7x faster than PCIe Gen 5 between all GPUs in the server.

Tables 1 and 2 show the maximum throughput performance, across a variety of input and output sequence lengths, of Llama 3.1-405B running on an 8-GPU H200 system.

Input \| Output Sequence Lengths	2,048 \| 128	32,768 \| 2,048	120,000 \| 2,048
Output Tokens/Second	399.9	230.8	49.6

NVIDIA internal measurements.Output tokens/second is inclusive of time to generate the first token. tok/s =total generated tokens; / total latency. DGX H200, TP8, FP8, Batch size tuned for maximum node throughput, TensorRT-LLM version 0.12.0.dev2024072300.

In addition to maximum throughput performance, we also show minimum latency performance using the same input and output sequence lengths:

Input \| Output Sequence Lengths	2,048 \| 128	32,768 \| 2,048	120,000 \| 2,048
Output Tokens/Second	37.4	33.1	22.8

NVIDIA internal measurements. Output tokens/second is inclusive of time to generate the first token. tok/s = total generated tokens; / total latency. DGX H200, TP8, FP8, Batch size = 1, TensorRT-LLM version 0.12.0.dev2024072300.

As these results show, H200 GPUs and TensorRT-LLM are already delivering great performance on Llama 3.1-405B at launch, in both latency-optimized and throughput-optimized scenarios.

Build with Llama 3.1 every step of the way using NVIDIA software

To adopt Llama within applications, you require the following functionality:

Capability to tailor a model to a specific domain
Support for embedding models to enable retrieval-augmented-generation (RAG) applications
Ability to evaluate model accuracy
Capability to keep a conversation on-topic and safe
Optimized inferencing solutions

With this release, NVIDIA is enabling you to perform all these tasks with NVIDIA software, to make it easier for adoption.

First, high-quality datasets are imperative for training, customizing, and evaluating language models. However, some developers find it challenging to gain access to quality datasets with suitable licensing terms.

NVIDIA addresses this issue by offering a synthetic data generation (SDG) pipeline, which builds on Llama 3.1, to help you create custom high-quality datasets.

Supercharging Llama 3.1 across NVIDIA Platforms | NVIDIA Technical Blog (1)

With Llama 3.1-405B, you get access to a state-of-the-art generative model that can be used as a generator in the SDG pipeline. The data-generation phase is followed by the Nemotron-4 340B Reward model to evaluate the quality of the data, filtering out lower-scored data and providing datasets that align with human preferences. The reward model tops the RewardBench leaderboard with an overall score of 92.0. It excels in the Chat-Hard subset, which tests the model’s ability to handle trick questions and nuances in instruction responses.For more information, see Creating Synthetic Data Using Llama 3.1 405B.

When the dataset is ready, it can be further curated, customized, and evaluated with the NVIDIA NeMo platform.

NVIDIA NeMo

To build custom models and applications with Llama 3.1, you can use NVIDIA NeMo. NeMo offers an end-to-end platform for developing custom generative AI, anywhere. It uses advanced parallelism techniques to maximize NVIDIA GPU performance, managing GPU resources and memory across multiple nodes and GPUs.

Use this open-source platform for any or all of the following tasks:

Curate data with NeMo Curator to compile high-quality data and improve the custom model’s performance by cleaning, deduplicating, filtering, and classifying datasets.
Customize models with parameter-efficient fine-tuning (PEFT) techniques such as p-tuning, low-rank adaption (LoRA), and its quantized version (QLoRA). These techniques are useful for creating custom models without requiring a lot of computing power.
Steer model responses and align Llama 3.1 models to human preferences, making the modelsready to integrate into customer-facing applications. Current support in NeMo includes the following:
- Supervised fine-tuning (SFT)
- Reinforcement learning from human feedback (RLHF)
- Direct preference optimization (DPO)
- NeMo SteerLM
Streamline LLM evaluation with the NeMo Evaluator microservice, now in early access. This microservice can automatically evaluate against academic benchmarks, custom datasets, and evaluate with LLM-as-a-judge (useful in scenarios where ground truth is undefined).
Incorporate retrieval-augmented generation (RAG) capabilities with NeMo Retriever, a collection of microservices. This microservice provides state-of-the-art, open, and commercial data retrieval with high accuracy and maximum data privacy.
Alleviate hallucinations with NeMo Guardrails, which enables you to add programmable guardrails to LLM-based conversational applications, ensuring trustworthiness, safety, security, and controlled dialog. It can be extended with other guardrails and safety models, such as Meta’s latest Llama Guard. It also seamlessly integrates into developer tools, including popular frameworks such as LangChain and LlamaIndex.

Use these tools and more through NVIDIA AI Foundry.

Taking Llama everywhere

Meta-Llama 3.1-8B models are now optimized for inference on NVIDIA GeForce RTX PCs and NVIDIA RTX workstations.

With TensorRT Model Optimizer for Windows, Llama 3.1-8B models are quantized to INT4 with the AWQ post-training quantization (PTQ) method. This lower precision enables the ability to fit within the GPU memory available on NVIDIA RTX GPUs, as well as improving performance by reducing memory bandwidth bottlenecks. These models are natively supported with NVIDIA TensorRT-LLM, our open-source software that accelerates LLM inference performance.

Llama 3.1-8B models are also optimized on NVIDIA Jetson Orin for robotics and edge computing devices.

Maximum performance with Llama 3.1

All Llama 3.1 models support 128K context length and are available as base and instruct variants in BF16 precision.

These models are also now accelerated with TensorRT-LLM. TensorRT-LLM compiles the models into TensorRT engines, from model layers into optimized CUDA kernels using pattern matching and fusion, to maximize inference performance. These engines are then executed by the TensorRT-LLM runtime, which includes several optimizations:

in-flight batching
KV caching
quantization to support lower-precision workloads

TensorRT-LLM supports 128K long context length with scaled rotary position embedding (RoPE) technique, including multi-GPU and multi-node inference of the Llama 3.1-405B at BF16 precision level model on H100 and single node inference on H200.

Inference in FP8 precision is supported. Using post-training quantization (PTQ) on NVIDIA Hopper and NVIDIA Ada GPUs, you can optimize and reduce model complexity by creating smaller models with lower memory footprint, without sacrificing accuracy.

For the Llama 3.1-405B model, TensorRT-LLM has added support for FP8 quantization at a row-wise granularity level. This involves calculating a static scaling factor for each output weight channel (before execution) and a dynamic scaling factor for each token (during execution) to preserve maximum accuracy.

During the TensorRT engine build process, some complex layer fusions cannot be automatically discovered. TensorRT-LLM optimizes these using plugins that are explicitly inserted into the network graph definition at compile time to replace user-defined kernels such as the matrix multiplications from FBGEMM for the Llama 3.1 models.

For ease of use and deployment, TensorRT-Model-Optimizer and TensorRT-LLM optimizations are bundled together into NVIDIA NIM inference microservices.

NVIDIA NIM

Llama 3.1 is now supported through NVIDIA NIM, for production deployments. NIM inference microservices accelerate the deployment of generative AI models across NVIDIA-accelerated infrastructure anywhere, including cloud, data center, and workstations.

NIM offers support for dynamic LoRA adapter selection, enabling you to serve multiple use cases with a single foundation model. This is enabled through a multitier cache system that manages adapters across GPU and host memory, accelerated to simultaneously serve multiple use cases with special GPU kernels.

Next steps

With the NVIDIA-accelerated computing platform, you can build models and applications with Llama 3.1 every step of the way on any platform, from the datacenter to NVIDIA RTX and NVIDIA Jetson.

NVIDIA is committed to advancing, optimizing, and contributing to open-source software and models. Learn more about the NVIDIA AI platform for generative AI.

Supercharging Llama 3.1 across NVIDIA Platforms | NVIDIA Technical Blog (2024)