CookieCorner

Free Courses     Free Gallery     Free Library     Free FAQS     Free Links     Free Speech    



Easy Diff: Another file comparison program that uses a unique algorithm

Sample output:

Algorithm:
    1. Take note of the lines in both inputs that are equivalent. That would need n x m comparisons assuming that the first input is n lines long and the other is m lines long. Say that our container for this is H (for hits). H would contain pairs (x, y) such that line x + 1 in the first input is the same as line y + 1 in the second input.
Denote as H[N].x the first component of the nth element of H. Same way, H[N].y the second component.
For any two elements of H, say H[i], H[j]
where i < j, H[i].x <= H[j].x and if H[i].x = H[j].x, then H[i].y < H[j].y.
    2. Construct an integer array of length (or number of elements) equal to the number of same lines in the first step (that is, the length of the array is equal to the number of elements in H). Let us name this array as A. Each element of this array would be initialized to 1. A[0] is the first element and
A[LENGTH - 1] is the last. LENGTH is equal to the number of elements in the array, which is also equal to the number of elements in H.
    Let us start with the second to the last element of the array. This is denoted as A[LENGTH - 2].
At present, A[LENGTH - 1], the last element is equal to 1. A[LENGTH - 2] would be equal to
2 (A[LENGTH - 1] + 1) if
H[LENGTH - 2] (the second to the last element of H) is not in conflict or is not overlapping
with H[LENGTH - 1].
(Two elements are in conflict when either H[i].x == H[j].x or H[i].y >= H[j].y
for some i and j such that i < j < LENGTH.)
If it (A[LENGTH - 2]) is in conflict (with A[LENGTH - 1]), then its value is retained at 1.
    Looking ahead of the algorithm, we will be able to generate an M (for max) subset of H
such that M[i].x < M[j].x and M[i].y < M[j].y for any i, j such that i < j < LENGTH.
    We proceed to derive A[LENGTH - 3], then A[LENGTH - 4] and so on until A[0] (in that sequence), almost the same way. A[LENGTH - k] = max{A[LENGTH - k + p]} + 1
where p < k <= LENGTH such that H[LENGTH - k] is not in conflict with H[LENGTH - p]. (In simpler words, the value of A[LENGTH - k] is one plus the maximum value of previous values
(A[LENGTH - k + p]) such that H[LENGTH - k] is not in conflict with
H[LENGTH - p]).
    3. The last part would just be tracing the largest subsequence. We start at the index with the biggest value, max{A[k]} for all
k < LENGTH (the smallest index or leftmost element if it is not unique). Say this is A[a]. We next find the leftmost or the first
A[a + s] where A[a] = A[a + s] - 1, and H[a] and H[a + s] are not in conflict. We find next
A[a + s + t] where A[a + s] = A[a + s + t] - 1 and so on until we find A[z] = 1. Our largest subsequence
is therefore the lines of H[a], H[a + s], H[a + s + t], ...

    It is that easy! (Fix me freely if I am found to be broken.)

Algorithm in action:
    Given this two text files as input:


The first part would yield:
i     H[i]
0    (0, 0)    // this means that the first lines of the input files are the same
1    (0, 1)    // this means that the first line in the first input file is the same as the second line in the other input file and so on
2    (9, 2)
3    (9, 5)
4    (10, 3)
5    (10, 6)
6    (11, 4)
7    (11, 7)

Second part follows:
i    A[i]
7    1
6    1    // (11, 7) is in conflict with (11, 4)
5    2    // (10, 6) is in conflict with (11, 4) but not with (11, 7)
4    2    // a largest subsequence containing (10, 3) can also have one of (11, 4) or (11, 7)
3    3    // (9, 5) is not in conflict with (10, 6)
2    3    // (9, 2) is not conflicting with (10, 3)
1    4    // (0, 1) is not conflicting with either (9, 2) or (9, 5)
0    4    // (0, 0) is conflicting with (0, 1) but not with either (9, 2) or (9, 5)

Third part follows:
i    A[i]      H[i]
0    4      (0, 0)    // (0, 0) is chosen because A[0] is largest and because it came before (0, 1)
2    3      (9, 2)    // (9, 5) can be chosen instead but the algorithm is strict about getting the element with the
                                  // smaller index or that which came first
4    2      (10, 3)    // same reasoning why this was chosen over (10, 6)
6    1      (11, 4)    // same reasoning why this was chosen over (11, 7)

    And we have the sample output above.

Java Implementation:
    See the source code please. The steps of the algorithm is marked as /*1*/ for the first step and so on.

To do:
    -gui needed please help

Critical bug in previous versions:
    bug in the html presentation, does not "clean" the string. This program would break when an input file contains html entities like < and >, quotes and others. I think I have fixed it already.

Copyright (c) Jonee Ryan Ty
    Permission is granted to copy, distribute, and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts. A copy of the license is available on the GNU site.

Contact the original author through vampire_janus@yahoo.com. Download everything in here.


home | main | top