Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

2:4 Sparse Llama FP8: SOTA performance for NVIDIA Hopper GPUs

Introducing 2:4 Sparse Llama with FP8

December 18, 2024
Alexandre Marques Eldar Kurtić Mark Kurtz Dan Alistarh Shubhra Pandit Faraz Shahsavan
Related topics:
Artificial intelligence
Related products:
Red Hat AI

    A sparse summary

    • Hardware-accelerated sparsity: Achieves an average of 30% lower latency and 20% higher throughput from sparsity alone on NVIDIA Hopper GPUs.
    • FP8 quantization compatible: Supports NVIDIA's FP8 format with sparsity, enabling an average of 1.7X lower latency and 1.5X faster throughput.
    • Open source with vLLM: Built into vLLM with custom CUTLASS-based sparse, FP8 kernels for further adoption and development.

    Advancing AI efficiency is more critical than ever, and sparsity has proven to be a cornerstone in this pursuit. Building on our previous work at Neural Magic with the 2:4 Sparse Llama 3.1 8B foundation model–which increases model efficiency by eliminating unnecessary parameters while preserving accuracy–we are excited to introduce the next step forward: sparse 8-bit floating point (FP8) models and the associated high-performance kernels for vLLM.

    FP8 precision, the latest hardware-supported quantization format on NVIDIA GPUs, delivers significant compute and memory reductions, comparable to 8-bit integer (INT8) formats, with 2X faster compute and 2X lower memory usage. The difference, though, is the floating-point nature provides a better representation of outliers within the model than INT8, enabling easier and more accurate quantization. By combining FP8 with the advantages of the 2:4 sparsity pattern and CUTLASS-based performance kernels in vLLM, we achieve optimal hardware utilization and state-of-the-art performance on NVIDIA's Hopper architecture. This integration unlocks new levels of efficiency with a total of 1.7X lower latency and 1.5X more queries per second for throughput with full accuracy recovery.

    Inference performance and accuracy results for dense BF16, sparse BF16, dense FP8, and sparse FP8 versions of Llama 3.1 8B through vLLM on an H100 GPU.
    Figure 1: Inference performance and accuracy results for dense BF16, sparse BF16, dense FP8, and sparse FP8 versions of Llama 3.1 8B through vLLM on an H100 GPU.
    Server-based inference performance results for a multi-turn chat use case with batch size one at various QPS rates for dense BF16, dense FP8, and sparse FP8 versions of Llama 3.1 8B through vLLM on an H100 GPU.
    Figure 2: Server-based inference performance results for a multi-turn chat use case with batch size one at various QPS rates for dense BF16, dense FP8, and sparse FP8 versions of Llama 3.1 8B through vLLM on an H100 GPU.

    Cutting latency with CUTLASS

    The development of high-performance FP8 sparse kernels for vLLM marks a new chapter in inference optimization, delivering state-of-the-art performance on NVIDIA Hopper GPUs. By combining FP8 precision and the 2:4 structured sparsity pattern, we created custom CUTLASS v3.6 kernels—NVIDIA’s toolkit for efficient matrix multiplication—that tackle memory bottlenecks and improve computational efficiency. FP8 cuts memory bandwidth usage by half compared to BF16, while sparsity doubles the theoretical tensor core throughput by skipping redundant computations.

    Building on existing FP8 kernel implementations in vLLM, which leverage CUTLASS and the torch.float8_e4m3fn tensor type, we enabled high-performance sparse FP8 support through:

    • Custom sparse FP8 CUTLASS kernels: Optimized to handle sparse FP8 weight matrices with FP8 quantized activations efficiently.
    • Optimization and tuning: Fine-tuning CUTLASS parameters across scenarios to maximize inference performance.

    Matrix multiplication performance benchmarks illustrate the impact of these advancements. Compared to a naive PyTorch BF16 implementation, the FP8 CUTLASS kernels alone achieve up to 1.9X speedups. These gains are further amplified when combined with the 2:4 sparsity pattern, delivering up to 30% lower latency across batch sizes. FP8 precision and sparsity unlock a total potential speedup of 2.5X over BF16 while maintaining consistent performance advantages over dense FP8 implementations, as shown in Figure 3.

    Performance comparison of different matmul kernel implementations on an H100 GPU for a weight matrix of size 4096x28672.
    Figure 3: Performance comparison of different matmul kernel implementations on an H100 GPU for a weight matrix of size 4096x28672.

    Accuracy without compromise

    To ensure Sparse FP8 models retain accuracy while delivering inference performance gains and easy-to-apply quantization, we employed a two-part quantization strategy: dynamic per-token FP8 for activations and static per-channel FP8 for weights. This quantization was applied post-training, following fine-tuning processes identical to those outlined in the original 2:4 Sparse Llama blog.

    The fine-tuning and evaluations were conducted across the same key domains to measure accuracy recovery and robustness:

    • Mathematical reasoning: Fine-tuned on GSM8K, evaluated with strict-match accuracy in a zero-shot setting.
    • Coding tasks: Fine-tuned on Evol-CodeAlpaca, evaluated with pass@1 performance on HumanEval.
    • Conversational AI: Fine-tuned on Ultrachat-200K, evaluated with win rate on AlpacaEval.

    As summarized in Table 1, Sparse FP8 models achieve near-full accuracy recovery, comparable to earlier results observed with INT8 quantization. These findings demonstrate the robustness of FP8 quantization, ensuring maximum compression and performance gains without sacrificing accuracy.

    Accuracy evaluations comparing dense BF16, sparse BF16, and sparse FP8 versions of Llama 3.1 8B.
    Table 1: Accuracy evaluations comparing dense BF16, sparse BF16, and sparse FP8 versions of Llama 3.1 8B.

    Efficient inference at scale

    To evaluate the real-world impact of sparse FP8 models, we benchmarked their performance compared to dense FP8 and dense BF16 versions. These benchmarks were generated across scenarios reflecting practical deployments to ensure consistency across various prefill vs. decode sizes, including code completion, docstring generation, instruction following, multi-turn chat, summarization, and long-context retrieval-augmented generation (RAG), as given in Table 2.

    Prefill and decode token amounts for various real-world use cases used for benchmarking.
    Table 2: Prefill and decode token amounts for various real-world use cases used for benchmarking.

    Single-Stream Latency Results

    To illustrate the extreme latency side for inference, we benchmarked the various scenarios in a single-stream setup: batch size one and a single request at a time. Here, sparse FP8 models show an average 1.7X faster inference latency than dense BF16 models, with up to 30% of these gains attributed to sparsity alone, as seen in Table 3.

    Inference latencies across various use cases for dense BF16, dense FP8, and sparse FP8 versions of Llama 3.1 8B through vLLM on an H100 GPU with batch size 1 and 1 request at a time.
    Table 3: Inference latencies across various use cases for dense BF16, dense FP8, and sparse FP8 versions of Llama 3.1 8B through vLLM on an H100 GPU with batch size 1 and 1 request at a time.

    Multi-Stream Throughput Results

    To illustrate the alternative in the performance envelope, we benchmarked the various scenarios in a throughput setup: batch size one and all requests at once. Here, sparse FP8 models show an average 1.5X increase in queries per second than dense BF16 models, with up to 20% of these gains attributed to sparsity alone, as seen in Table 4.

    Throughput inference queries per second across various use cases for dense BF16, dense FP8, and sparse FP8 versions of Llama 3.1 8B through vLLM on an H100 GPU with batch size 1 and all requests at once.
    Table 4: Throughput inference queries per second across various use cases for dense BF16, dense FP8, and sparse FP8 versions of Llama 3.1 8B through vLLM on an H100 GPU with batch size 1 and all requests at once.

    Multi-stream server results

    To evaluate the scalability of Sparse FP8 models in real-world server deployments and ensure the throughput and latency benchmarks align, we present comprehensive results for two key use cases. These benchmarks scale queries per second (QPS) from single-stream to full-throughput conditions while measuring inter-token latency (ITL).

    Figure 2, introduced earlier in the blog, showcases the performance for multi-turn chat, demonstrating consistent performance gains across a range of QPS rates.
    Figure 4, below, focuses on code completion, a more decode-heavy workload, where Sparse FP8 models similarly deliver consistent performance improvements across various QPS rates.

    Both figures provide two key perspectives for interpreting the results:

    • Fixed ITL (Inter-Token Latency) as a Service Level Agreement (SLA): By setting a target ITL, the graphs illustrate how Sparse FP8 models increase the number of queries that can be processed concurrently while maintaining the desired performance level.
    • Fixed QPS (Queries Per Second): At a specific QPS rate, the graphs demonstrate improvements in ITL, showcasing faster response times and lower latency.
    Server-based inference performance results for a code completion use case with batch size one at various QPS rates for dense BF16, dense FP8, and sparse FP8 versions of Llama 3.1 8B through vLLM on an H100 GPU.
    Figure 4: Server-based inference performance results for a code completion use case with batch size one at various QPS rates for dense BF16, dense FP8, and sparse FP8 versions of Llama 3.1 8B through vLLM on an H100 GPU.

    Unlock efficiency

    Sparse FP8 models enable exceptional performance, scalability, and cost-effectiveness on NVIDIA Hopper GPUs. By reducing memory bandwidth demands, maximizing tensor core throughput, and maintaining full accuracy recovery, sparse FP8 models enable faster, more efficient AI deployments without compromising quality.

    Neural Magic is proud to continue its commitment to the open-source community, empowering developers, researchers, and enterprises to adopt and build upon these innovations. Our open source FP8 models and high-performance kernels for vLLM are designed to simplify integration and experimentation for real-world use cases.

    Looking to get started in open source?

    • Explore Sparse FP8 models on Hugging Face.
    • Access our FP8 kernels on GitHub within vLLM.
    Last updated: September 18, 2025

    Recent Posts

    • Trusted software factory: Building trust in the agentic AI era

    • Build a zero trust AI pipeline with OpenShift and RHEL CVMs

    • Red Hat Hardened Images: Top 5 benefits for software developers

    • How EvalHub manages two-layer Kubernetes control planes

    • Tekton joins the CNCF as an incubating project

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.