Syntax-Highlighted Diffs using Pygments and difflib
For the last few months I have been working on a homework management system called csman. I don’t really know why we call it csman; I personally think it’s kind of a goofy name; but it was to be a replacement of an old website called cs1man, so I guess the name just carried over.
The new csman website is implemented in Python, using the Django framework and running on top of a Postgres database server. So far, it has been an absolute delight to work on this stack; it has been very smooth in every way.
One unique thing about the courses that we manage with csman is that they allow students to rework their assignments, so you frequently tell the students, “Here are nine things you need to fix; resubmit when you’re done.” And then they will edit their files and re-upload these files back into the system. To grade the redo, all you really want is a visual diff of the two files, the old one and the new one, to see if the students fixed everything you told them to fix. This is where a nice syntax-highlighted diff page would be incredibly helpful.
I learned about Pygments on the djangosnippets.org website (see the FAQ!), and I quickly found that the praise was completely deserved. It took only a few hours to drop Pygments into my project to get syntax-highlighted source code on the website. But, next I wanted diffs between two source files, and ideally also with syntax highlighting, and this was the next challenge to implement. I found a hint of how to do it on this page here, but it wasn’t exactly what I was looking for, code that generated actual HTML output for two files being diffed. However, it did mention this cool Python library called difflib, which computes diffs of any pair of sequences. Nifty!
Given this info, it turned out to be very straightforward to figure out! Both Pygments and difflib are written for simplicity and ease of use, so it was very easy to combine the two. My solution was to do the following:
- Generate lines of HTML-formatted code for both input files. The result is two lists of strings, each string being a line of HTML.
- Feed these lists of strings to difflib, to generate diff-output.
- Iterate over the diff-output to generate HTML displaying both files, with highlighting of the variations.
The first step, generating a list of HTML-formatted lines for each source-file, is very easy using Pygments:
def get_formatted_lines(file_obj):
'''
This function takes either a SubmittedFile or GradedFile object, and returns
a list of HTML-formatted lines from the input file's data.
'''
file_data = str(file_obj.file_data)
# Guess the lexer to use, since submissions could be in any of a dozen or so
# different programming languages that we teach. If we can't identify it,
# use the plain ol' TextLexer.
try:
lexer = guess_lexer_for_filename(file_obj.file_name, file_data)
except pygments.util.ClassNotFound:
lexer = TextLexer()
# Generate a list of HTML-formatted lines from the source file, and
# return that list!
formatter = HtmlFormatter()
formatted_lines = [t for (_, t) in formatter._format_lines(lexer.get_tokens(file_data))]
return formatted_lines
(The SubmittedFile and GradedFile objects are simply the Django models I have created to represent individual files uploaded for a student’s submission.)
The key function above is HtmlFormatter._format_lines() — this little gem will generate the exact results that we need. Pygments provides it so that users can subclass HtmlFormatter, but we just use it directly to get our formatted results.
Next, we feed our results into Python’s difflib:
def view_file_diff(request, file1_spec, file2_spec):
file1 = get_file_or_404(file1_spec)
file2 = get_file_or_404(file2_spec)
formatted_lines1 = get_formatted_lines(file1)
formatted_lines2 = get_formatted_lines(file2)
matcher = difflib.SequenceMatcher(None, formatted_lines1, formatted_lines2)
# This list goes to the Django template. It contains a list of
# dictionaries that contain the actual data to display.
snippets = []
for (tag, i1, i2, j1, j2) in matcher.get_opcodes():
if tag == ‘equal’:
# Both sequences have this set of lines.
snippets.append({’tag’:tag,
‘file1_linenums’ : ‘\n’.join([str(n+1) for n in range(i1, i2)]),
‘file1_code’ : ”.join(formatted_lines1[i1:i2]),
‘file2_linenums’ : ‘\n’.join([str(n+1) for n in range(j1, j2)]),
‘file2_code’ : ”.join(formatted_lines2[j1:j2])})
elif tag == ‘delete’:
# Only the left sequence has this set of lines.
snippets.append({’tag’:tag,
‘file1_linenums’ : ‘\n’.join([str(n+1) for n in range(i1, i2)]),
‘file1_code’ : ”.join(formatted_lines1[i1:i2])})
elif tag == ‘insert’:
# Only the right sequence has this set of lines.
snippets.append({’tag’:tag,
‘file2_linenums’ : ‘\n’.join([str(n+1) for n in range(j1, j2)]),
‘file2_code’ : ”.join(formatted_lines2[j1:j2])})
else:
assert(tag == ‘replace’)
# The right and left sequences have conflicting sets of lines.
snippets.append({’tag’:tag,
‘file1_linenums’ : ‘\n’.join([str(n+1) for n in range(i1, i2)]),
‘file1_code’ : ”.join(formatted_lines1[i1:i2]),
‘file2_linenums’ : ‘\n’.join([str(n+1) for n in range(j1, j2)]),
‘file2_code’ : ”.join(formatted_lines2[j1:j2])})
return render_helper(request, ‘courses/view_file_diff.html’,
{’file1′:file1, ‘file2′:file2, ’snippets’:snippets})
The actual diff-generation is very simple; it’s entirely contained within the matcher = difflib.SequenceMatcher(None, formatted_lines1, formatted_lines2) operation. Then, we just need to iterate over the regions of the file to process the results, which is what the for-loop does. For each chunk of the diff-output, we just use that information to figure out what to render.
For example, the diff output may say that both file1 and file2 are identical for lines 0..53. This means we need to get the HTML ready for these lines, and we also need to generate line-numbers for those regions. This is done with the grungy but straightforward instructions like this:
'file1_linenums' : '\n'.join([str(n+1) for n in range(i1, i2)]),
‘file1_code’ : ”.join(formatted_lines1[i1:i2]),
The string.join function will join together a list of strings, using the left-value as a separator. The first line above generates line-numbers for the specified range (note that difflib numbers lines from 0, but human beings typically number lines from 1), and the second line grabs out the HTML-formatted lines in that region and generates one giant HTML line for that region. Simple!
Finally, we need to generate HTML output for this result, and this is also very straightforward, given the contents of the snippets list we have built up:
<table class="highlighttable">
<tr>
{% with file1.submission as sub %}
<th colspan="2"><tt>{{ file1.file_name }}</tt> ({{ sub.assignment.short_name }} by {{ sub.student.user.username }} on {{ sub.submit_date|date:"m/d/y h:iA" }})</th>
{% endwith %}
{% with file2.submission as sub %}
<th colspan="2"><tt>{{ file2.file_name }}</tt> ({{ sub.assignment.short_name }} by {{ sub.student.user.username }} on {{ sub.submit_date|date:"m/d/y h:iA" }})</th>
{% endwith %}
</tr>
{% for snip in snippets %}
<tr>
{% ifequal snip.tag "equal" %}
<td class="linenos diff_equal"><div class="linenodiv"><pre>{{ snip.file1_linenums }}</pre></div></td>
<td class="code diff_equal"><div class="highlight"><pre>{{ snip.file1_code|safe }}</pre></div></td>
<td class="linenos diff_equal"><div class="linenodiv"><pre>{{ snip.file2_linenums }}</pre></div></td>
<td class="code diff_equal"><div class="highlight"><pre>{{ snip.file2_code|safe }}</pre></div></td>
{% endifequal %}
{% ifequal snip.tag "delete" %}
<td class="linenos diff_delete"><div class="linenodiv"><pre>{{ snip.file1_linenums }}</pre></div></td>
<td class="code diff_delete"><div class="highlight"><pre>{{ snip.file1_code|safe }}</pre></div></td>
<td class="diff_blank"><div class="linenodiv"><pre>{{ snip.file2_linenums }}</pre></div></td>
<td class="diff_blank"><div class="highlight"><pre>{{ snip.file2_code|safe }}</pre></div></td>
{% endifequal %}
{% ifequal snip.tag "insert" %}
<td class="diff_blank"><div class="linenodiv"><pre>{{ snip.file1_linenums }}</pre></div></td>
<td class="diff_blank"><div class="highlight"><pre>{{ snip.file1_code|safe }}</pre></div></td>
<td class="linenos diff_insert"><div class="linenodiv"><pre>{{ snip.file2_linenums }}</pre></div></td>
<td class="code diff_insert"><div class="highlight"><pre>{{ snip.file2_code|safe }}</pre></div></td>
{% endifequal %}
{% ifequal snip.tag "replace" %}
<td class="linenos diff_replace"><div class="linenodiv"><pre>{{ snip.file1_linenums }}</pre></div></td>
<td class="code diff_replace"><div class="highlight"><pre>{{ snip.file1_code|safe }}</pre></div></td>
<td class="linenos diff_replace"><div class="linenodiv"><pre>{{ snip.file2_linenums }}</pre></div></td>
<td class="code diff_replace"><div class="highlight"><pre>{{ snip.file2_code|safe }}</pre></div></td>
{% endifequal %}
</tr>
{% endfor %}
</table>
Most of the HTML formatting here is just to make the HTML diff-output happy; the Pygments HtmlFormatter class generates results to be wrapped with <pre> tags, and it also uses <div> tags for applying formatting tags to the results.
Of course, I have also added my own styles to indicate added/removed/changed lines, and CSS for these is easy:
.highlighttable .diff_equal { border:2px solid white; }
.highlighttable .diff_delete { border:2px solid #ffcf7f; }
.highlighttable .diff_insert { border:2px solid #7fff7f; }
.highlighttable .diff_replace { border:2px solid #ff7f7f; }
.highlighttable .diff_blank { border:2px solid white; }
.highlighttable td { vertical-align:top; }
.highlighttable pre { margin:0em; padding:0em; }
.highlighttable .linenos {
color: #7f7f7f;
text-align:right;
padding-left: 0.5em;
padding-right: 0.5em;
}
You can see that I have also tweaked some of the other style, for line numbers and a few of the HTML tags, to make sure that the diff-output all lines up properly.
And that’s it!
The one drawback of this approach is that the webpage is very wide, and not very easy to use as a diff tool. Clearly there needs to be better scrollbar functionality, but I think that would require some JavaScript and I’m not quite ready to do that yet. But that will certainly make the above functionality that much more useful.