Notes on Static Analysis and Abstract Interpretation - Intro

A few months after I bought the book Introduction to Static Analysis: An Abstract Interpretation Approach, I finally decided to quickly scan it to understand a bit more about program analysis. Besides this book, I also constantly look into Guannan’s Blog for inspiration.

This note will be structured similarly to that in the book.

1. Program Analysis

1.1 What is program analysis? What are the challenges?

Program analysis is analogous to analysis of many other systems we are familiar with. For example, when building a bridge, we need to examine how external nature affects the design. Similar specifications are needed to build useful software that interacts with the outside world.

To understand systems in other engineering disciplines, we often need to understand the nature that runs such systems. However, for computer software, it is the computer that handles the meaning of the program. The formal definition of a software’s runtime behavior is called semantics.

The biggest beneficiaries of program analysis tools might be programmers or developers as they could be used for quality assurance. Software that takes other programs as input (e.g. compiler, interpreter, etc) benefits from program analysis as well.

Program analysis techniques are not bound to runnable software only. High-level system configuration files can provide more information to the developers as well. For example, ConfigV analyzes configuration files to learn their specifications through association rule learning

Despite all the benefits of program analysis, building efficient tools for real-world software is not easy (thank god this keeps my job). One of the great challenges in program analysis is that as software grows with more and more functionalities, the possible input and program state space can easily be “greater than the number of atoms”. Therefore, one important factor in program analysis is cost-effectiveness.

1.2 Program Analysis Concepts

To produce useful results, program analyzers set target properties to compute. Such properties include safety property, liveness property, and information flow properties. Each property requires different levels of reasoning. For example, safety property only requires analysis of finite state executions, while liveness property also dealt with infinite states.

Another important characteristic that divides program analysis tools into two (almost) distinct part is when the analysis happens. Static analysis happens before the program execution, and dynamic analysis happens during execution. One distinguished example of static analysis is strong typing. Type checking for statically typed language often happens during the compilation phase and can prevent various undesired behavior before execution. For dynamic analysis, a widely used practice is user assertions. User-provided assertions check the condition of a program state during runtime.

Uncomputability

An ideal program analysis tool would always return the desired property to any given program. However, this is, in general, impossible. One canconical example to consider is the Halting problem, which states that, given a programming langauge, an arbitrary program in that language will finish running or continue to run forever. Alonso Church and Alan Turing proved simultaneously in 1936 that the halting problem is undecidable, i.e., there is no such algorithm or program that can always lead to a yes-or-no answer to any program.

As someone who has experience in the computing field, the above example might be not as surprising. In fact, the uncomputability of computer programs goes way beyond termination - any non-trivial semantic properties are also not computable. Despite a more sophisticated explanation, non-trivial semantic properties are sets of Turing machines that do not contain every Turing machine. In terms of programs, semantic properties are associated with the execution of programs, and non-triviality simply means that not all programs satisfy the property.

Formally, Rice theorem says that: Let \(\mathbb{L}\) be a Turing-complete language, and let \(\mathscr{P}\) be a nontrivial semantic property of programs of \(\mathbb{L}\). There exists no algorithm such that, for every program \(\text{p} \in \mathbb{L}\), it returns \(\textbf{true}\) if and only if \(\text{p}\) satisfies the semantic property \(\mathscr{P}\).

With the Rice theorem, we should give up finding an ideal program analysis that can determine fully automatically when a program satisfies interesting properties like lack of certain run-time errors, the absence of information flows, or functional correctness. However, this does not say anything against useful program analysis tools. It only constantly reminds us that we need to make trade-offs, like giving up some automation or targeting more domain-specific programs (because analyzing all programs is often not feasible due to the above theorem).

(updated on 05/10/2022)