Skip to content

Commit

Permalink
diffcore-delta.c: update the comment on the algorithm.
Browse files Browse the repository at this point in the history
The comment at the top of the file described an old algorithm
that was neutral to text/binary differences (it hashed sliding
window of N-byte sequences and counted overlaps), but long time
ago we switched to a new heuristics that are more suitable for
line oriented (read: text) files that are much faster.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
  • Loading branch information
Junio C Hamano committed Jul 1, 2007
1 parent 706098a commit af3abef
Showing 1 changed file with 9 additions and 12 deletions.
21 changes: 9 additions & 12 deletions diffcore-delta.c
Original file line number Diff line number Diff line change
Expand Up @@ -5,23 +5,20 @@
/*
* Idea here is very simple.
*
* We have total of (sz-N+1) N-byte overlapping sequences in buf whose
* size is sz. If the same N-byte sequence appears in both source and
* destination, we say the byte that starts that sequence is shared
* between them (i.e. copied from source to destination).
* Almost all data we are interested in are text, but sometimes we have
* to deal with binary data. So we cut them into chunks delimited by
* LF byte, or 64-byte sequence, whichever comes first, and hash them.
*
* For each possible N-byte sequence, if the source buffer has more
* instances of it than the destination buffer, that means the
* difference are the number of bytes not copied from source to
* destination. If the counts are the same, everything was copied
* from source to destination. If the destination has more,
* everything was copied, and destination added more.
* For those chunks, if the source buffer has more instances of it
* than the destination buffer, that means the difference are the
* number of bytes not copied from source to destination. If the
* counts are the same, everything was copied from source to
* destination. If the destination has more, everything was copied,
* and destination added more.
*
* We are doing an approximation so we do not really have to waste
* memory by actually storing the sequence. We just hash them into
* somewhere around 2^16 hashbuckets and count the occurrences.
*
* The length of the sequence is arbitrarily set to 8 for now.
*/

/* Wild guess at the initial hash size */
Expand Down

0 comments on commit af3abef

Please sign in to comment.