Compare files side by side and hightlight diff using Java | Apache Commons Text diff | Myers algorithm

In this article we will create a simple basic file diff tool/program using Apache commons text library & output diff in HTML.

Example in this article:

  • Take 2 text file as input i.e. file-1.txt & file-2.txt
  • Compare both files using Apache commons text & generate HTML output highlighting differences in both files.
  • Create simple java main program to keep things simple.

Basic concept:

Apache commons text library’s org.apache.commons.text.diff is based on a “very efficient algorithm from Eugene W. Myers“. Algorithm executes comparison of ‘left’ & ‘right’ Strings character by character. We can provide a visitor object which is used by algorithm to specify,

  • If a character is present in both ‘left’ & ‘right’ – Referred as ‘KeepCommand
  • If a character is present in ‘left’ file but not in ‘right’, that means it need to be deleted from ‘left’ to match ‘right. – Referred as ‘DeleteCommand
  • If a character is not present in ‘left’ but present in ‘right’, that means it needs to be inserted into ‘left’ to match right – Referred as ‘InsertCommand

Library Dependency (Maven, gradle etc.) –

You will need a library dependency added to your project.

Here is the simplest code to diff two strings using org.apache.commons.text.diff.StringsComparator with a very simple CommandVisitor. This program uses brackets to highlight differences in two strings as shown in output.



Lets code to compare files & generate HTML diff

Now we will apply same logic as above & compare two files & generate diff in HTML format with proper highlighted differences.

For the output, we will use this very simple HTML template which has proper look-n-feel for showing content on left & right side-by-side. This has placeholders ${left} & ${right} which we will replace using String.replace() from our program. (You may chose to use proper template libraries like velocity etc.)

Here is the complete code for diff program with explanatory comments inline. It has,

  • Main program which reads ‘file-1.txt’ & ‘file-2.txt’ from root of project. (Make sure to have these files in root). It uses apache commons text to do the diff using custom visitor.
  • Visitor class which stores diff of characters in HTML format with highlighting spans. It also provides a final HTML generating method which provides final diff HTML.





Time to execute the file diff

We will test the program using below test files in project root. These files have lines that have 40% commonality & also lines which do not.

With these files in place, we execute the program.

This will generate “finalDiff.html” in the project root. Open this file in a browser. It will show the highlighted diff as shown below.




Further improvements:

You can take this & improve it further more on your own to achieve better diff tools. Here are few ideas to get you started.

  • Currently if lines don’t have 40% commonality, then we simply show them on separate lines, You can improve this to try to match it with next lines to see if it matches with other lines & align with that line instead.
  • Currently HTML highlighting is done per character. Enhance it to group continuous inserts or  delete characters in single span.
  • Improve program to be more efficient for larger files & also try different ways of output formats.
  • Convert main program into UI oriented tool with a fancy look-n-feel. Enjoy !



 

Leave a Reply

Your email address will not be published. Required fields are marked *