SW/HW Co-optimization Strategy for LLMs – Part 2 (Software)

SW is eating the world. SW landscape of LLMs? What are the emerging libraries/SW frameworks to improve LLM performance?

Jan 2, 2024

7 min read

With a continual influx of new LLM models and features (check out the hugging face LLM leaderboard), software tools and libraries are being released at an accelerated rate. This rapid progression is also sparking numerous innovations in AI hardware. When optimizing LLMs from a system perspective, it’s crucial to understand that while ongoing research emerges daily from major companies and research institutes such as Meta, Google, OpenAI, Nvidia, Stanford, and others, the software stack/libraries can’t directly translate everything into hardware for execution immediately. Only a selective small set of software features can be supported that requires several months (~6 months)of development for production. If these features need to be supported in AI hardware accelerator, it demands an even longer development cycle (2–4 years) in case of any architectural change. Addressing this discrepancy between software and hardware optimization for LLMs poses a significant challenge, one that we aim to tackle in this series of posts!

Emerging software tools and libraries cater to both LLM training and inferencing. In this post, our focus will be specifically on LLM deployment and an in-depth exploration of how these tools enhance LLM performance. In an upcoming post, we’ll delve into LLM training software like deepspeed, Fairscale, colossal-AI, and more.

Previously, I discussed enhancements to LLM models and highlighted new research features. You can refresh your memory by revisiting that discussion below

SW/HW Co-optimization Strategy for Large Language Models (LLMs)

LLMs, designed as specialized domain AI models, rely on the conventional AI stack to convert models into machine code for execution on AI hardware. Different hardware companies offer their respective software stacks to facilitate AI inference. Below, I’ll showcase three prominent hardware vendors (Nvidia, AMD, Intel) and their corresponding software platforms:

Traditional AI SW stack

As shown in the above table, Nvidia leads the generative AI landscape with its proprietary CUDA software ecosystem. Offering a robust suite of tools and libraries like cuDNN, cuBLAS, etc., Nvidia accelerates top AI use cases on its graphics processing units. Their recent release, TensorRT-LLM, introduces a rich set of features like continuous batching, vLLM, and tensor parallelism, optimizing LLMs. AMD focuses on RoCm to bolster their robust AI hardware, the MI 2/300 series. Meanwhile, Intel champions oneAPI, oneDNN, and OpenVINO APIs and toolchains supporting AI models across Intel’s CPU, GPU, and NPU platforms, aiming for unified software and an open ecosystem across AI hardware.

Adopting LLMs onto a conventional AI software stack starts with enabling fundamental functions and operators. Most operators are typically supported because LLMs rely on the transformer-based architecture, which includes encoders/decoders. However, certain new operators like positional encoding may require specific attention. Taking TensorRT as an example, unsupported operators fall back to PyTorch. The graph is then segmented into two regions: gray (executed on PyTorch) and green (executed on TensorRT) in the below picture. An excessive number of operators in the gray region indicates poor performance.

import torch_tensorrt
#convert to tensorRT
trt_module = torch_tensorrt.compile(model, ir = "FX", ...)
#run inference
trt_module(x)

Nvidia’s TensorRT supports many optimizations for DL models, including layers and tensor fusion, which fuse multiple operations or layers into a single kernel to reduce memory access frequency and improve performance; kernel auto-tuning, which selects the best algorithms or layers, batch size leading to best performance on targeted HW; mixed-precision, which can convert FP32 datatype to FP16/INT8 for fast inference.

Different company has their implementation to support those features, which are common across various AI models and not specific to LLMs, now let’s dive into specific LLM-related SW.

Acceleration LLM SW frameworks and libraries

The conventional AI software stack falls short in optimizing large LLMs due to their high computational and memory demands. Several emerging open-source software frameworks and libraries have emerged to accelerate LLM inferencing, catering to developer needs. I’ve compiled a list of several popular ones in the table below, with reference links ( Ref [1] vLLM; Ref [2] streamingLLM; Ref [3] FlexGen; Ref [4] OpenLLM; Ref [5] DeepSpeed)

The top four frameworks, developed by major corporations, offer extensive features compared to the bottom three, which originate from universities and are tailored to specific features. For instance, vLLM initially focused on paged kv attention, gradually expanding its support for additional features. Bold-highlighted features denote the most commonly supported ones across various LLM frameworks. Let’s delve into some of these features below:

1. Continuous batching (20+x throughput)

During batch generation, some sequences complete token generation faster than others, resulting in idle time until all batches conclude. Continuous batching addresses this issue by integrating new tasks into faster-finishing batches. It leverages attention masking to shield previous sequences, preventing interference.

Example of four sequences using continuous batching. Source: How continuous batching enables 23x throughput in LLM inference while reducing p50 latency (Ref [6])

Figure above shows the basic principle of continuous batching. With traditional static batching, each sequence will end at different time step and result in GPU under utilization. In this specific example, sequence 3 ends at T5 and wastes GPU resource at T6, T7, and T8. To solve the issue, once an end token is detected, a new sequence (i.e. S5, S6, S7) is inserted for full GPU utilization.

2. Model parallel (Tensor and pipeline parallelism):

Tensor parallelism involves dividing a tensor into multiple sub-tensors, with each device managing a sub-tensor and performing computations. The resulting partial outcomes are combined to produce the final result. In the figure below (Ref [7]), if tensor B is vertically split into two parts, matrix A will multiply both segments. Each multiplication occurs on a separate device, and the resulting partial outputs C are then concatenated to derive the final result.

Tensor parallel illustration. Source: Colossal-AI (Ref [7])

Pipeline parallelism involves segmenting the model into multiple chunks based on layers, assigning each chunk to a separate device (GPU). During the forward pass, intermediate activations or backward pass gradients are transferred to another device for further processing until reaching the final output. This method utilizes multiple devices simultaneously to enhance throughput, but it requires swift and seamless communication between the devices.

Pipeline parallel illustration. Source: Colossal-AI (Ref [7])

3.FlexGen (Ref[8])

Flexgen introduces an offloading strategy aimed at constrained computing platforms with limited memory capacity. It optimizes memory and compute resources by leveraging CPU, GPU, and disk capabilities, identifying efficient tensor storage and access patterns.

Two different schedules demonstrating FlexGen strategy. Source: FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU (Ref[8])

Compared to two other offloading-based frameworks (DeepSpeed zero-inference and Hugging Face Accelerate), FlexGen offers significantly higher throughput (figure below).

100× higher maximum throughput improvement with FlexGen. Source: FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU (Ref[8])

Beyond the above-mentioned features, new technologies are being developed at a fast pace, one notable example is Apple recently released its offloading strategy, leveraging flash memory to accelerate LLMs (Ref [9]), leading to running models up to twice the size of the available DRAM, with a 4–5x and 20–25x increase in inference speed compared to naive loading approaches in CPU and GPU, respectively.

Key message

I previously discussed the fundamental concepts of paged attention and quantization in my earlier post. With new research rapidly emerging on LLM models and acceleration techniques, various organizations are developing underlying software to support them. It’s crucial for organizations and developers to carefully choose the most suitable options based on their requirements. A robust and efficient software is essential to implement these acceleration techniques effectively, maximizing AI hardware resources.

In my upcoming post, I’ll dive into advanced AI hardware and memory technologies that accelerate LLMs. Stay tuned!