In this article we will simply understand static code analysis & how its done programmatically with very simple example.
Whats Static Code Analysis?
- Static – Code that is not running.
- Code Analysis – Analyze code towards some goal like finding possible problems (like PMD) or finding if code is following certain standards (like checkstyle) etc.
So basically we take source code & analyze it programmatically without actually running code / executing program. End goal is to get some kind of result which can tell us if the code/program is written in a manner that we want. Here is the list of some static code analysis tools for Java.
How static code analysis is done?
Generally preferred way to perform static code analysis is ‘Lexical Analysis‘. End goal of lexical analysis is to read source file & convert all of it’s syntax into some data structure (like Tree) so that our static code analysis program can easily go through code using that data structure & look for anomalies.
Lets understand terminologies of lexical analysis using this program that we wish to analyze (Not a complete program, but small snippet for keeping it simple).
1 2 3 |
if(i > 0){ a = b + c; } |
- Lexer / Toketnizer
- Its a program that reads source code (like AnalyzeMe.java file) character by character.
- Then it attempts to find meaning of those character by dividing/grouping them using spaces, tabs, semicolons etc. & create ‘Tokens’ or ‘Lexemes’ (meaningful set of character or words). For ex: ‘if’, ‘i’ ,’0′ & so on.
- Lexeme/ Tokens
- Tokens are nothing but words or strings, which have certain meaning from programaming language perspective. For ex: if token is keyword in Java, +& = tokens are operator, 0 token is literal, ‘i’, ‘a’, ‘b’, ‘c’ are identifiers or variables.
- Grammar
- This is generally a special text file which specifies set of rules which Lexer uses to create tokens. This is how Lexer program known which words are keywords, which are literals etc.
- Generally grammar files will be available for Java language versions depending on library we use. Structure of grammar file depends on libraries we use.
- Parser
- Parser takes tokens or lexemes & converts it into ‘Abstract Syntax Tree’ so that our static code analysis tool can easily walk through tree of syntax’s & try to find issues with source code.
- Abstract Syntax Tree (AST)
- AST is nothing but tree structure representation of source code which is created from tokens or lexemes.
- Below is a possible AST structure of above AnalyzeMe.java
1 2 3 4 5 6 7 8 9 |
- If-Condition - Compare ‘>’ - Variable ‘i’ - Literal ‘0’ - If-body - Operator ‘=’ to variable ‘a’ - Operator ‘+’ - Variable ‘b’ - Variable ‘c’ |
Libraries for lexical anlysis
There are several libraries which provides easy ways to perform lexical analysis & AST preparation. Here are some libraries
Lets do Static Code Analysis?
We will use ALTLR (ANTLR4 Maven Dependency) for our example for performing static code analysis. Checkstyle also uses ANTLR. Checktyle is a tool which verifies that source code adheres to specific coding standards.
Let create a very simple example similar to what checkstyle does using ANTLR.
Example:
For this example, we will perform a check to make sure that — if class is
abstract , then class name should start with ‘Abstract’. Else if class is not
abstract, class name should not start with ‘Abstract. This is very similar to what checkstyle does in their code.
Steps:
- Create or download grammar file for Java. You can get it from ANTLR repository (Java9.g4)
- Create a simple maven project with dependency antlr4-runtime . Also add antlr4-maven-plugin which can convert above grammar file into java classes like lexer, parser, listener etc. These are the classes which we can use to prepare AST.
- Write a program to use above generated classes & prepare abstract syntax tree of source code to analyze.
- Then traverse through (or walk) AST & perform logic on it for abstract class name check.
Here is the project structure
Note that grammar file must be under “src/main/antlr4” because this is default location used by maven plugin. Under this directory, you can have a package structure that you wish to have for generated classes. As per above structure, generated classes from Java9.g4, will be generated under “target/generated-sources/antrl4” with package as “com.itsallbinary.grammer.java9”
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.itsallbinary</groupId> <artifactId>static-code-analysis</artifactId> <version>0.0.1-SNAPSHOT</version> <properties> <maven.compiler.source>1.8</maven.compiler.source> <maven.compiler.target>1.8</maven.compiler.target> </properties> <dependencies> <!-- https://mvnrepository.com/artifact/org.antlr/antlr4-runtime --> <dependency> <groupId>org.antlr</groupId> <artifactId>antlr4-runtime</artifactId> <version>4.7.2</version> </dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.antlr</groupId> <artifactId>antlr4-maven-plugin</artifactId> <version>4.7.2</version> <executions> <execution> <goals> <goal>antlr4</goal> </goals> </execution> </executions> </plugin> </plugins> </build> </project> |
You can run,
mvn clean package so that maven plugin will generate classes from grammar file. Here are the generated classes.
Here is the code to perform actual check. Added inline comments in code for better understanding.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
package com.itsallbinary.staticanalysis; import java.util.regex.Pattern; import org.antlr.v4.runtime.CharStreams; import org.antlr.v4.runtime.CodePointCharStream; import org.antlr.v4.runtime.CommonTokenStream; import org.antlr.v4.runtime.tree.ParseTree; import org.antlr.v4.runtime.tree.ParseTreeWalker; import com.itsallbinary.grammer.java9.Java9BaseListener; import com.itsallbinary.grammer.java9.Java9Lexer; import com.itsallbinary.grammer.java9.Java9Parser; import com.itsallbinary.grammer.java9.Java9Parser.ClassDeclarationContext; public class AntlrExamples { public static void main(String[] args) { // Class code to perform static code analysis on. You may read it from ".java" // files. String[] classes = new String[] { "public abstract class AbstractPerson { void work(){} }", "public class AbstractPerson { void work(){} }", "public abstract class JustPerson { void work(){} }" }; // Iterate on class codes. for (String classCode : classes) { // Prepare lexer & parser from code. CodePointCharStream inputStream = CharStreams.fromString(classCode); Java9Lexer Java9Lexer = new Java9Lexer(inputStream); CommonTokenStream commonTokenStream = new CommonTokenStream(Java9Lexer); Java9Parser java8Parser = new Java9Parser(commonTokenStream); // Prepare parse tree or AST (Abstract Syntax Tree) ParseTree tree = java8Parser.compilationUnit(); // Register listener class which will perform checks. ClassNameOfAbstractClassListener listener = new ClassNameOfAbstractClassListener(); ParseTreeWalker walker = new ParseTreeWalker(); // Walk method will walk through all tokens & call appropriate listener methods // where we will perform checks. walker.walk(listener, tree); } } } /** * Listener class which verifies that abstract class must have name starting * with Abstract. Errors will be printed to console. * */ class ClassNameOfAbstractClassListener extends Java9BaseListener { // Walk method will call this method when it comes across a token which is a // class declaration. @Override public void enterClassDeclaration(ClassDeclarationContext ctx) { // Fetch class name from the token. String className = ctx.normalClassDeclaration().identifier().getText(); // Check if the class from this token is abstract. boolean isAbstract = ctx.normalClassDeclaration().classModifier().stream().anyMatch(m -> m.ABSTRACT() != null); // Regex pattern to verify name starts with word Abstract Pattern format = Pattern.compile("^Abstract.+$"); boolean matching = format.matcher(className).find(); System.out.println("\nClass Name = " + className + " | Is Abstract? = " + isAbstract); // Perform checks & print result in console. if (isAbstract && !matching) { System.out.println("Problem: Abstract class but name does not start with 'Abstract'"); } else if (!isAbstract && matching) { System.out.println("Problem: Not an Abstract class but name starts with 'Abstract'"); } else { System.out.println("Class declaration is all good."); } } } |
Here is the output of the code. So we successfully programmatically analyzed source in static manner.
1 2 3 4 5 6 7 8 |
Class Name = AbstractPerson | Is Abstract? = true Class declaration is all good. Class Name = AbstractPerson | Is Abstract? = false Problem: Not an Abstract class but name starts with 'Abstract' Class Name = JustPerson | Is Abstract? = true Problem: Abstract class but name does not start with 'Abstract' |
You can find complete source code from this article in github repository.
PMD
There is another famous static code analysis tool i.e. PMD which performs checks to find out coding flaws like unnecessary objects, empty catch blocks etc. PMD uses JavaCC library to walk through code syntax & perform static code analysis.
SpotBugs (Former FindBugs)
FindBugs or SpotBugs is another static code analysis tool, which directly works with compiled bytecode of classes instead of actual non-compiled source code like PMD. They use Apache commons BCEL & ASM which provides easy ways to perform Java bytecode analysis which is different than the lexical analysis that we saw in this article.
Refer below article for understanding byte code level static code analysis with very simple example similar to findbugs.
Do your own static code analysis programmatically in Java | Similar to FindBugs using Apache BCEL