The Hidden Cost of Inference: A Phase-Level Examination of the Energy-Efficiency of LLM inference

PROJECT summary

With the widespread deployment of large language models (LLMs), the energy cost of inference has become a growing concern. Unlike training, inference occurs continuously in production, where small inefficiencies accumulate into significant energy demands. The inference process can be divided into two distinct phases: prompt processing, which is primarily compute-bound, and token generation, which is largely memory-bound. Reducing overall energy consumption during inference requires a dual approach: optimising computational efficiency in prompt processing and enhancing memory management in token generation. Existing approaches typically either allocate the two inference phases to separate machines optimised for each phase or focus on optimising only one aspect. This PhD project aims to investigate the relative energy contributions of these phases to the total energy consumption, identify the dominant energy drivers within each, and explores their interdependencies. Through empirical investigation, it aims to propose software-level optimisation strategies that balance both phases to minimise total energy consumption. By considering factors such as input size, model design choices, and memory allocation patterns, this work will advance our understanding of energy behaviour in LLM inference and guide future energy-efficient deployments.

This is a Sector Plan 2 project.

Duration: 1 October 2024 - 30 September 2028