Skip to content

Commit

Permalink
log: re-encode commit messages before grepping
Browse files Browse the repository at this point in the history
If you run "git log --grep=foo", we will run your regex on
the literal bytes of the commit message. This can provide
confusing results if the commit message is not in the same
encoding as your grep expression (or worse, you have commits
in multiple encodings, in which case your regex would need
to be written to match either encoding). On top of this, we
might also be grepping in the commit's notes, which are
already re-encoded, potentially leading to grepping in a
buffer with mixed encodings concatenated. This is insanity,
but most people never noticed, because their terminal and
their commit encodings all match.

Instead, let's massage the to-be-grepped commit into a
standardized encoding. There is not much point in adding a
flag for "this is the encoding I expect my grep pattern to
match"; the only sane choice is for it to use the log output
encoding. That is presumably what the user's terminal is
using, and it means that the patterns found by the grep will
match the output produced by git.

As a bonus, this fixes a potential segfault in commit_match
when commit->buffer is NULL, as we now build on logmsg_reencode,
which handles reading the commit buffer from disk if
necessary. The segfault can be triggered with:

        git commit -m 'text1' --allow-empty
        git commit -m 'text2' --allow-empty
        git log --graph --no-walk --grep 'text2'

which arguably does not make any sense (--graph inherently
wants a connected history, and by --no-walk the command line
is telling us to show discrete points in history without
connectivity), and we probably should forbid the
combination, but that is a separate issue.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
  • Loading branch information
Jeff King authored and Junio C Hamano committed Feb 11, 2013
1 parent be5c9fb commit 04deccd
Show file tree
Hide file tree
Showing 2 changed files with 78 additions and 7 deletions.
27 changes: 20 additions & 7 deletions revision.c
Original file line number Diff line number Diff line change
Expand Up @@ -2268,7 +2268,10 @@ static int commit_rewrite_person(struct strbuf *buf, const char *what, struct st
static int commit_match(struct commit *commit, struct rev_info *opt)
{
int retval;
const char *encoding;
char *message;
struct strbuf buf = STRBUF_INIT;

if (!opt->grep_filter.pattern_list && !opt->grep_filter.header_list)
return 1;

Expand All @@ -2279,13 +2282,23 @@ static int commit_match(struct commit *commit, struct rev_info *opt)
strbuf_addch(&buf, '\n');
}

/*
* We grep in the user's output encoding, under the assumption that it
* is the encoding they are most likely to write their grep pattern
* for. In addition, it means we will match the "notes" encoding below,
* so we will not end up with a buffer that has two different encodings
* in it.
*/
encoding = get_log_output_encoding();
message = logmsg_reencode(commit, encoding);

/* Copy the commit to temporary if we are using "fake" headers */
if (buf.len)
strbuf_addstr(&buf, commit->buffer);
strbuf_addstr(&buf, message);

if (opt->grep_filter.header_list && opt->mailmap) {
if (!buf.len)
strbuf_addstr(&buf, commit->buffer);
strbuf_addstr(&buf, message);

commit_rewrite_person(&buf, "\nauthor ", opt->mailmap);
commit_rewrite_person(&buf, "\ncommitter ", opt->mailmap);
Expand All @@ -2294,18 +2307,18 @@ static int commit_match(struct commit *commit, struct rev_info *opt)
/* Append "fake" message parts as needed */
if (opt->show_notes) {
if (!buf.len)
strbuf_addstr(&buf, commit->buffer);
format_display_notes(commit->object.sha1, &buf,
get_log_output_encoding(), 1);
strbuf_addstr(&buf, message);
format_display_notes(commit->object.sha1, &buf, encoding, 1);
}

/* Find either in the commit object, or in the temporary */
/* Find either in the original commit message, or in the temporary */
if (buf.len)
retval = grep_buffer(&opt->grep_filter, buf.buf, buf.len);
else
retval = grep_buffer(&opt->grep_filter,
commit->buffer, strlen(commit->buffer));
message, strlen(message));
strbuf_release(&buf);
logmsg_free(message, commit);
return retval;
}

Expand Down
58 changes: 58 additions & 0 deletions t/t4210-log-i18n.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
#!/bin/sh

test_description='test log with i18n features'
. ./test-lib.sh

# two forms of é
utf8_e=$(printf '\303\251')
latin1_e=$(printf '\351')

test_expect_success 'create commits in different encodings' '
test_tick &&
cat >msg <<-EOF &&
utf8
t${utf8_e}st
EOF
git add msg &&
git -c i18n.commitencoding=utf8 commit -F msg &&
cat >msg <<-EOF &&
latin1
t${latin1_e}st
EOF
git add msg &&
git -c i18n.commitencoding=ISO-8859-1 commit -F msg
'

test_expect_success 'log --grep searches in log output encoding (utf8)' '
cat >expect <<-\EOF &&
latin1
utf8
EOF
git log --encoding=utf8 --format=%s --grep=$utf8_e >actual &&
test_cmp expect actual
'

test_expect_success 'log --grep searches in log output encoding (latin1)' '
cat >expect <<-\EOF &&
latin1
utf8
EOF
git log --encoding=ISO-8859-1 --format=%s --grep=$latin1_e >actual &&
test_cmp expect actual
'

test_expect_success 'log --grep does not find non-reencoded values (utf8)' '
>expect &&
git log --encoding=utf8 --format=%s --grep=$latin1_e >actual &&
test_cmp expect actual
'

test_expect_success 'log --grep does not find non-reencoded values (latin1)' '
>expect &&
git log --encoding=ISO-8859-1 --format=%s --grep=$utf8_e >actual &&
test_cmp expect actual
'

test_done

0 comments on commit 04deccd

Please sign in to comment.