TY - JOUR
T1 - Revealing the Unseen
T2 - AI Chain on LLMs for Predicting Implicit Dataflows to Generate Dataflow Graphs in Dynamically Typed Code
AU - Huang, Qing
AU - Luo, Zhiwen
AU - Xing, Zhenchang
AU - Zeng, Jinshan
AU - Chen, Jieshan
AU - Xu, Xiwei
AU - Chen, Yong
N1 - Publisher Copyright:
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2024/9/27
Y1 - 2024/9/27
N2 - Dataflow graphs (DFGs) capture definitions (defs) and uses across program blocks, which is a fundamental program representation for program analysis, testing and maintenance. However, dynamically typed programming languages like Python present implicit dataflow issues that make it challenging to determine def-use flow information at compile time. Static analysis methods like Soot and WALA are inadequate for handling these issues, and manually enumerating comprehensive heuristic rules is impractical. Large pre-trained language models (LLMs) offer a potential solution, as they have powerful language understanding and pattern matching abilities, allowing them to predict implicit dataflow by analyzing code context and relationships between variables, functions, and statements in code. We propose leveraging LLMs' in-context learning ability to learn implicit rules and patterns from code representation and contextual information to solve implicit dataflow problems. To further enhance the accuracy of LLMs, we design a five-step chain of thought (CoT) and break it down into an Artificial Intelligence (AI) chain, with each step corresponding to a separate AI unit to generate accurate DFGs for Python code. Our approach's performance is thoroughly assessed, demonstrating the effectiveness of each AI unit in the AI Chain. Compared to static analysis, our method achieves 82% higher def coverage and 58% higher use coverage in DFG generation on implicit dataflow. We also prove the indispensability of each unit in the AI Chain. Overall, our approach offers a promising direction for building software engineering tools by utilizing foundation models, eliminating significant engineering and maintenance effort, but focusing on identifying problems for AI to solve.
AB - Dataflow graphs (DFGs) capture definitions (defs) and uses across program blocks, which is a fundamental program representation for program analysis, testing and maintenance. However, dynamically typed programming languages like Python present implicit dataflow issues that make it challenging to determine def-use flow information at compile time. Static analysis methods like Soot and WALA are inadequate for handling these issues, and manually enumerating comprehensive heuristic rules is impractical. Large pre-trained language models (LLMs) offer a potential solution, as they have powerful language understanding and pattern matching abilities, allowing them to predict implicit dataflow by analyzing code context and relationships between variables, functions, and statements in code. We propose leveraging LLMs' in-context learning ability to learn implicit rules and patterns from code representation and contextual information to solve implicit dataflow problems. To further enhance the accuracy of LLMs, we design a five-step chain of thought (CoT) and break it down into an Artificial Intelligence (AI) chain, with each step corresponding to a separate AI unit to generate accurate DFGs for Python code. Our approach's performance is thoroughly assessed, demonstrating the effectiveness of each AI unit in the AI Chain. Compared to static analysis, our method achieves 82% higher def coverage and 58% higher use coverage in DFG generation on implicit dataflow. We also prove the indispensability of each unit in the AI Chain. Overall, our approach offers a promising direction for building software engineering tools by utilizing foundation models, eliminating significant engineering and maintenance effort, but focusing on identifying problems for AI to solve.
KW - AI chain
KW - Dataflow graph
KW - Large Language Models
UR - http://www.scopus.com/inward/record.url?scp=85206219567&partnerID=8YFLogxK
U2 - 10.1145/3672458
DO - 10.1145/3672458
M3 - Article
AN - SCOPUS:85206219567
SN - 1049-331X
VL - 33
JO - ACM Transactions on Software Engineering and Methodology
JF - ACM Transactions on Software Engineering and Methodology
IS - 7
M1 - 183
ER -