ID:16078
 
Another project I've been working on is a file utility. Some time ago I was frustrated with files I was downloading because they had gaps of blank space where data should be. I found however that on downloading the file again, sometimes just on a different day, I would either get the complete file or a version that had different gaps.

It occurred to me that when two files have the same length and the gaps appear in different places, I could either minimize the gaps or eliminate them entirely by performing a byte-by-byte comparison and filling in data from one with the other. However I quickly discovered that no tool exists—that I could find anyway—to do that. Sheesh!

Since it was a simple concept, I decided to build it myself. My early results were amazing; I was able to successfully combine several files I had in two different incomplete versions. After that I decided expand on it and see if I could make it useful for more people.

One of my to-do items is to add support to merge more than one file at once. You can do successive merges right now of course, but I'd like to broaden that so that if three different versions exist, you can merge them. I also need to add a feature that allows the files to be different lengths, just adding the data from the longer file to the output. But wait, there's more!

There are currently few or no binary diff tools that can tell you how two binary files differ. I'd like to develop something that can search for like blocks in multiple files and figure out how they're arranged, then splice them together. Basically my idea would analyze files and label the blocks like so:

File 1: A - B - C - D - E
File 2: B - F - C - G - E
File 3: A - F - D - G

From this it would deduce several rules: A≺B, B≺C, C≺D, D≺E, B≺F, F≺C, C≺G, G≺E, A≺F, F≺D, D≺G. It would finally place them in this order:

A - B - F - C - D - G - E

Potential for ambiguity exists, so when two blocks are found whose order is not known, one will be written at random to the output and the other discarded. If a conflict appears between two blocks so that one precedes another that in turn precedes it, all of those blocks are discarded.

Right now the main thing I'm missing is the search algorithm to find blocks and label them. The algorithm to arrange them is already complete.
Have you considered looking at the source for BitTorrent or other P2P programs? They basically do the same thing, intentionally.

Still, I can't believe a tool like this doesn't exist.. it sounds like something someone would have hacked together in the old days of UNIX when data backups weren't always that reliable.
Actually P2P programs do something different with files. They work on a set of hashes, and can only verify when a specific chunk with a specific location is correct. When they trade chunks, they also trade the chunk's location in the file.

The tool I've got currently can take two partial files from things like that and merge them, because they store their data as a full file with blank spots. (There are exceptions however. Kazaa for example would always keep its partial files in a .dat file that contained a mix of data and metadata. Shareaza, which I use, keeps partials and also keeps .sd files with the metadata.) Newsgroup downloads, which inspired this, also cause such gaps if they were encoded with yEnc. yEnc keeps track of each chunk's size and relative position, hence you can build the file with the chunks in the correct position even if you don't have all of them, when using an .nzb file listing all the chunks.

What I eventually want is a utility that can also reassemble a full file from several copies that do not have gaps, merely missing segments where the data has been spliced end to end. The data is in order, but each chunk is not necessarily at the correct position because of gaps. This happens to UUencoded files when some chunks go missing. (And if you don't mind the data bloat of expressing a 2-bit value in a full byte, this concept could be used for gene sequencing--with the exception that it can't understand mutation.)