This article will look at using the “difflib” module in Python.
This module provides classes and functions for comparing sequences. It can be used for example, for comparing files, and can produce information about file differences in various formats, including HTML and context and unified diffs.
Differ class
This is a class for comparing sequences of lines of text, and producing human-readable differences or deltas. Differ uses SequenceMatcher
both to compare sequences of lines, and to compare sequences of characters within similar (near-matching) lines.
Each line of a Differ
delta begins with a two-letter code:
Code | Meaning |
---|---|
‘- ‘ | line unique to sequence 1 |
<‘+ ‘ | line unique to sequence 2 |
‘ ‘ | line common to both sequences |
‘? ‘ | line not present in either input sequence |
Lets see an example
# importing the difflib module import difflib from difflib import Differ # the strings string_1 = "This is the first string to check" string_2 = "This is the second string to check" # using the splitlines() function lines_string1 = string_1.splitlines() lines_string2 = string_2.splitlines() # using the Differ() and compare() function diff = difflib.Differ() my_diff = diff.compare(lines_string1, lines_string2) # printing the results print("First String:", string_1) print("Second String:", string_2) print("Difference between the Strings") print('\n'.join(my_diff))
This displayed the following
>>> %Run difflibediffer.py First String: This is the first string to check Second String: This is the second string to check Difference between the Strings - This is the first string to check ? --- ^ + This is the second string to check ? ^^^^^
get_close_matches method
get_close_matches
(word, possibilities, n=3, cutoff=0.6)
Return a list of the best “good enough” matches. word is a sequence for which close matches are desired (typically a string), and possibilities is a list of sequences against which to match word (typically a list of strings).
Optional argument n (default 3
) is the maximum number of close matches to return; n must be greater than 0
.
Optional argument cutoff (default 0.6
) is a float in the range [0, 1]. Possibilities that don’t score at least that similar to word are ignore
Lets look at an example
from difflib import get_close_matches my_list1 = get_close_matches('mas', ['master', 'mask', 'basking', 'task', 'mass', 'massive', 'miss', 'mess'], n=1, cutoff=0.3) my_list2 = get_close_matches('mas', ['master', 'mask', 'basking', 'task', 'mass', 'massive', 'miss', 'mess'], n=2, cutoff=0.3) my_list3 = get_close_matches('mas', ['master', 'mask', 'basking', 'task', 'mass', 'massive', 'miss', 'mess'], n=3, cutoff=0.3) print("Matching words:", my_list1) print("Matching words:", my_list2) print("Matching words:", my_list3)
This displayed the following
>>> %Run diffligclosematches.py Matching words: ['mass'] Matching words: ['mass', 'mask'] Matching words: ['mass', 'mask', 'master']
SequenceMatcher class
The SequenceMatcher method will compare two provided strings and return the data representing the similarity between the two strings
You can use the ratio object to return a measure of the sequences’ similarity as a float in the range
Lets look at an example
# importing the difflib library import difflib from difflib import SequenceMatcher # strings string_1 = "This is the first string to check" string_2 = "This is the second string to check" # using the SequenceMatcher() function my_sequence = SequenceMatcher(a = string_1, b = string_2) # printing the result print("First String:", string_1) print("Second String:", string_2) print("Sequence Matched:", my_sequence.ratio())
This displayed the following
>>> %Run difflibsequence.py First String: This is the first string to check Second String: This is the second string to check Sequence Matched: 0.8656716417910447
unified_diff class
difflib.unified_diff(a, b, fromfile=”, tofile=”, fromfiledate=”, tofiledate=”, n=3, lineterm='\n')
Compare a and b (lists of strings); return a delta (a generator generating the delta lines) in unified diff format.
Unified diffs are a compact way of showing just the lines that have changed plus a few lines of context.
The changes are shown in an inline style. The number of context lines is set by n which defaults to three.
Lets look an example
# importing the required modules import sys import difflib from difflib import unified_diff # defining the string variables string1 = ['C++\n', 'Java\n', 'Python\n', 'Javascript\n', 'HTML\n', 'Programming\n'] string2 = ['Python\n', 'Lua\n', 'Perl\n', 'Go\n', 'Rust\n', 'Programming\n'] # using the unified_diff() function sys.stdout.writelines(unified_diff(string1, string2))
This displayed the following
>>> %Run difflibunified.py --- +++ @@ -1,6 +1,6 @@ -C++ -Java Python -Javascript -HTML +Lua +Perl +Go +Rust Programming
Links
https://docs.python.org/3/library/difflib.html