Free Courses | Free Gallery | Free Library | Free FAQS | Free Links | Free Speech |
|
Easy Diff: Another file comparison program that uses a unique algorithm |
Sample output:
t1.txt | t2.txt |
a |
a |
b c d e f g h i |
a |
j k l |
j k l |
m n o p |
j k l |
Algorithm:
1.
Take note of the lines in both inputs that are equivalent. That would
need n x m comparisons assuming that the first input is n lines long
and the other is m lines long. Say that our container for this is H
(for hits). H would contain pairs (x, y) such that line x + 1 in the
first input is the same as line y + 1 in the second input.
Denote as
H[N].x the first component of the nth element of H. Same way, H[N].y
the second component.
For any two elements of H, say H[i], H[j]
where
i < j, H[i].x <= H[j].x and if H[i].x = H[j].x, then H[i].y
< H[j].y.
2. Construct an integer
array of length (or number of elements) equal to the number of same
lines in the first step (that is, the length of the array is equal to
the number of elements in H). Let us name this array as A. Each
element of this array would be initialized to 1. A[0] is the first
element and
A[LENGTH - 1] is the last. LENGTH is equal to the number
of elements in the array, which is also equal to the number of
elements in H.
Let us start with the
second to the last element of the array. This is denoted as A[LENGTH
- 2].
At present, A[LENGTH - 1], the last element is equal to 1.
A[LENGTH - 2] would be equal to
2 (A[LENGTH - 1] + 1) if
H[LENGTH
- 2] (the second to the last element of H) is not in conflict or is
not overlapping
with
H[LENGTH - 1].
(Two elements are in conflict
when either H[i].x == H[j].x or H[i].y >= H[j].y
for some i and j
such that i < j < LENGTH.)
If it (A[LENGTH - 2]) is in
conflict (with A[LENGTH - 1]), then its value is retained at 1.
Looking ahead of the algorithm, we will
be able to generate an M (for max) subset of H
such that M[i].x <
M[j].x and M[i].y < M[j].y for any i, j such that i < j <
LENGTH.
We proceed to derive A[LENGTH - 3], then A[LENGTH - 4] and so on until A[0] (in that sequence),
almost the same way. A[LENGTH - k] = max{A[LENGTH - k + p]} + 1
where
p < k <= LENGTH such that H[LENGTH - k] is not in conflict with
H[LENGTH - p]. (In simpler words, the value of A[LENGTH - k] is one
plus the maximum value of previous values
(A[LENGTH - k + p]) such
that H[LENGTH - k] is not in conflict with
H[LENGTH - p]).
3. The last part would just be
tracing the largest subsequence. We start at the index with the
biggest value, max{A[k]} for all
k < LENGTH (the smallest index or
leftmost element if it is not unique). Say this is A[a]. We next find
the leftmost or the first
A[a + s] where A[a] = A[a + s] - 1, and
H[a] and H[a + s] are not in conflict. We find next
A[a + s + t]
where A[a + s] = A[a + s + t] - 1 and so on until we find A[z] = 1.
Our largest subsequence
is therefore the lines of H[a], H[a + s],
H[a + s + t], ...
It is that easy!
(Fix me freely if I am found to be broken.)
Algorithm in
action:
Given this two text files as
input:
t1.txt |
t2.txt |
a |
a |
The first part would yield:
i H[i]
0
(0, 0) // this means that the first
lines of the input files are the same
1 (0, 1) //
this means that the first line in the first input file is the same as
the second line in the other input file and so on
2
(9, 2)
3 (9, 5)
4 (10, 3)
5
(10, 6)
6 (11, 4)
7 (11, 7)
Second part follows:
i A[i]
7
1
6 1 // (11, 7) is in conflict
with (11, 4)
5 2 // (10, 6) is in
conflict with (11, 4) but not with (11, 7)
4 2 //
a largest subsequence containing (10, 3) can also have one of (11, 4)
or (11, 7)
3 3 // (9, 5) is not in
conflict with (10, 6)
2 3 // (9, 2)
is not conflicting with (10, 3)
1 4 //
(0, 1) is not conflicting with either (9, 2) or (9, 5)
0
4 // (0, 0) is conflicting with (0, 1) but not with
either (9, 2) or (9, 5)
Third part follows:
i
A[i] H[i]
0 4
(0, 0) // (0, 0) is chosen because A[0] is largest
and because it came before (0, 1)
2 3
(9, 2) // (9, 5) can be chosen instead but the
algorithm is strict about getting the element with the
// smaller index or that which came first
4 2
(10, 3) // same reasoning why this was chosen over
(10, 6)
6 1 (11, 4) //
same reasoning why this was chosen over (11, 7)
And
we have the sample output above.
Java Implementation:
See
the source code please. The steps of the algorithm is marked as /*1*/
for the first step and so on.
To do:
-gui needed please help
Critical bug in previous versions:
bug in the html presentation, does not "clean"
the string. This program would break when an input file contains html
entities like < and >, quotes and others. I think I have fixed it already.
Copyright (c) Jonee Ryan Ty
Permission is granted to copy,
distribute, and/or modify this document under the terms of the GNU
Free Documentation License, Version 1.1 or any later version
published by the Free Software Foundation; with no Invariant
Sections, with no Front-Cover Texts, and with no Back-Cover Texts.
A copy of the license is available on the GNU site.
Contact the original author through vampire_janus@yahoo.com.
Download everything in here.