IR2Vec: LLVM IR based Scalable Program Embeddings

S. VenkataKeerthy, Rohit Aggarwal, Shalini Jain, Maunendra Desarkar, Ramakrishna Upadrasta and Y. N. Srikant

Published in ACM TACO (arXiv) IR2Vec GitHub IR2Vec_Video

IR2Vec is a LLVM IR based framework to generate distributed representations for the source code in an unsupervised manner, which can be used to represent programs as input to solve machine learning tasks that take programs as inputs. It can capture intrinsic characteristics of the program. This is achieved by using the flow analyses information like Use-Def, Reaching Definitions and Live Variable information of the program.

The entities of the IR are modeled as relationships, and their representations are learned to form a seed embedding vocabulary.For this we create a Knowledge Graph by modelling LLVM IR of the program as entities and relations. Then a representation learning algorithm is used to learn the embeddings of these entities. Such embeddings exhibit semantic relationships and form clusters demonstrating them.

Such seed embeddings are annotated with the flow information to capture semantics of the program and propagated. The vectors to represent programs at various levels (instruction, function, module) can be formed based on the application.

Performance

We demonstrate the effectiveness of the embeddings on two different tasks

  • Heterogeneous Device Mapping
  • Prediction of Thread Coarsening factor On both of these tasks we achieve better accuracies and speed-ups with very few cases of slow-downs than the state-of-the-art methods.

Other Salient Features

  • Compatibility with non-sequential models
    • Orders of magnitude of faster training time (Upto 8000x reduction)
    • Non data hungry models
  • No Out of Vocabulary issues
    • Sufficient to expose a training space of O(|opcodes| + |types| + |arguments|)
      • when compared to a training space of O(|opcodes| x |types| x |arguments|)
    • Maintains very small vocabulary with the embeddings of following entities

Artifacts

Publish Tests

Code and other artifacts are available in our GitHub page.

Publications based on IR2Vec

1. The Next 700 ML-Enabled Compiler Optimizations    
     S. VenkataKeerthy, Siddharth Jain, Umesh Kalvakuntla, Pranav Sai Gorantla, Rajiv Shailesh Chitale, Eugene Brevdo, Albert Cohen, Mircea Trofin, Ramakrishna Upadrasta

2. ML-LLVM-Tools: Towards Seamless Integration of Machine Learning in Compiler Optimizations    
     Siddharth Jain, S. VenkataKeerthy, Umesh Kalvakuntla, Albert Cohen, Ramakrishna Upadrasta

3. RL4ReAl: Reinforcement Learning for Register Allocation    
     S. VenkataKeerthy, Siddharth Jain, Anilava Kundu, Rohit Aggarwal, Albert Cohen, and Ramakrishna Upadrasta

4. Reinforcement Learning assisted Loop Distribution for Locality and Vectorization    
     Shalini Jain, S. VenkataKeerthy, Rohit Aggarwal, Tharun Kumar Dangeti, Dibyendu Das, Ramakrishna Upadrasta

5. Packet Processing Algorithm Identification using Program Embeddings    
     S. VenkataKeerthy, Yashas Andaluri, Sayan Dey, Rinku Shah, Praveen Tammana, Ramakrishna Upadrasta

6. POSET-RL: Phase ordering for Optimizing Size and Execution Time using Reinforcement Learning    
     Shalini Jain, Yashas Andaluri, S. VenkataKeerthy, Ramakrishna Upadrasta

Funding

This research is funded by the Department of Electronics & Information Technology and the Ministry of Communications & Information Technology, Government of India. This work is partially supported by a Visvesvaraya PhD Scheme under the MEITY, GoI (PhD-MLA/04(02)/2015-16), an NSM research grant (MeitY/R&D/HPC/2(1)/2014), a Visvesvaraya Young Faculty Research Fellowship from MeitY, and a faculty research grant from AMD.

What do you think?

If you have any comments or questions, feel free to reach out to us at ir2vec-developers@googlegroups.com

Last Modified: