PolyUFC: Polyhedral Compilation Meets Roofline Analysis for Uncore Frequency Capping

Nilesh Rajendra Shah, M V V S Manoj Kumar, Dhairya Baxi, and Ramakrishna Upadrasta

Published in CGO’26

POLYUFC, an MLIR based compilation flow for uncore frequency capping that combines (performance and power) roofline analyses and polyhedral compilation-based static analysis for characterization of affine programs. We introduce a parametric mathematical model that links operational intensity and uncore frequency to derive frequency caps, validated through empirical evaluation on real hardware. By embedding these caps into Pluto optimized code generated by Polygeist, we achieve improvements in Energy Delay Product (EDP) up to 42% on compute-bound, and up to 54% on bandwidth-bound programs—carefully selected from ML-models from vision/NLP domains and POLYBENCH—over Intel UFS driver. Our framework is retargetable across multiple micro-architectures; and can handle multiple optimization goals like performance, energy and EDP,and is applicable across inter/intra dialects.

Performance/Power Roofline based Characterization

We evaluate the value of Operational Intensity (OI) against both performance and power roofline boundaries to determine the bound and bottleneck characteristics of the program. So, the I metric provides crucial insights into the compute and bandwidth characteristics of the program. Based on the relationship between OI and the time machine balance, programs are classified as follows:

Compute-Bound (CB): OI >= time balance
Bandwidth-Bound (BB): OI < time balance

We propose a mathematical model that estimates performance, bandwidth and power, with uncore frequency cap ($f_c$), and Operational Intensity ($OI$ or $I$) as parameters under the given performance and power roofs. We estimate the performance and bandwidth for affine programs on a target architecture by decomposing total execution time ($T$) into computation time ($T_Ω$) and memory transfer time ($T_Q$); each are parameterized by uncore frequency cap ($f_c$) and Operational Intensity ($I$). Computation time $T_Ω$ is based on total FLOPs and FPU throughput at a fixed core frequency. And, time taken by data movement $T_Q$ is derived from cache analysis; this accounts for both cache hits and misses to accurately estimate data movement costs between the processor core and DRAM.

PolyUFC-CM: An Approximate Cache Analysis for Set-associative Caches

We model each cache set as a fully-associative cache. First, we compute reuse-pairs for each set and model all the conflict/capacity misses using the constraint on cache associativity for a cache level. To model sharing of working set sizes across threads, the total cache misses (of any category) are approximated by dividing the sequential miss-count by the number of OpenMP program threads. This simple heuristic provides a first-order approximation, though it could miss inter-thread conflict and coherence misses.

Example:

Uncore Frequency Capping per Loop-nest and Phase Change

Analysis is fine-grain at affine IR with statement level characterization, and Application is at loop-nest level at affine/linalg IR. Based on the characterization, objective (EDP, Performance, Energy) and tuning parameters, searching is done using linear/binary search. For each CB loop-nest, we take minimum of uncore frequency caps obtained across all the statements. For each BB loop-nest, we take maximum of uncore frequency caps obtained across all the statements. For ML kernels (like conv2d and lm-head-matmul), we choose linalgOp-level granularity for frequency capping.

sdpa from bert:

Go to Supplementary Page:

PolyUFC

Funding

This work is supported by the Prime Minister’s Research Fellowship (PMRF) programme, Government of India, with additional funding from faculty grants provided by AMD, and Qualcomm.