A tree-based representation of source code structure used in program analysis.
An Abstract Syntax Tree (AST) is a hierarchical data structure that represents the syntactic structure of source code, where each node corresponds to a construct in the programming language — such as expressions, statements, or declarations. Unlike a raw parse tree, an AST strips away superficial syntactic details like punctuation and grouping symbols, retaining only the semantically meaningful structure of the program. This makes it a compact and manipulable representation that compilers, interpreters, and static analysis tools can operate on efficiently.
ASTs are constructed during the parsing phase of compilation or interpretation. A lexer first tokenizes raw source text, and a parser then organizes those tokens into the tree according to the grammar of the language. Once built, the AST serves as the primary intermediate representation for subsequent stages: semantic analysis checks type correctness and scope, optimization passes restructure the tree for efficiency, and code generation traverses it to emit machine code or bytecode. The tree structure makes recursive algorithms natural and powerful for these tasks.
In machine learning and AI, ASTs have become increasingly important as researchers tackle problems involving code understanding and generation. Models trained on source code — such as code completion systems, bug detectors, and program synthesizers — often consume ASTs rather than raw text, because the tree structure encodes syntactic relationships that flat token sequences obscure. Graph neural networks applied to ASTs can learn representations that respect the hierarchical nature of programs, improving performance on tasks like vulnerability detection, code summarization, and automated refactoring.
Program synthesis, a subfield of AI concerned with automatically generating programs from specifications or examples, relies heavily on ASTs as the search space over which synthesis algorithms operate. Techniques like neural-guided search and differentiable programming use AST representations to constrain outputs to syntactically valid programs, dramatically reducing the search complexity. As large language models are increasingly applied to coding tasks, hybrid approaches that combine neural generation with AST-based validity checking have emerged as a promising direction for reliable and controllable code generation.