Skip to content

Commit

Permalink
get_maintainer: correctly parse UTF-8 encoded names in files
Browse files Browse the repository at this point in the history
While the script correctly extracts UTF-8 encoded names from the
MAINTAINERS file, the regular expressions damage my name when parsing
from .yaml files.  Fix this by replacing the Latin-1-compatible regular
expressions with the unicode property matcher \p{L}, which matches on
any letter according to the Unicode General Category of letters.

The proposed solution only works if the script uses proper string
encoding from the outset, so instruct Perl to unconditionally open all
files with UTF-8 encoding.  This should be safe, as the entire source
tree is either UTF-8 or ASCII encoded anyway.  See [1] for a detailed
analysis.

Furthermore, to prevent the \w expression from matching non-ASCII when
checking for whether a name should be escaped with quotes, add the /a
flag to the regular expression.  The escaping logic was duplicated in
two places, so it has been factored out into its own function.

The original issue was also identified on the tools mailing list [2].
This should solve the observed side effects there as well.

Link: https://lore.kernel.org/all/dzn6uco4c45oaa3ia4u37uo5mlt33obecv7gghj2l756fr4hdh@mt3cprft3tmq/ [1]
Link: https://lore.kernel.org/tools/20230726-gush-slouching-a5cd41@meerkat/ [2]
Signed-off-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
  • Loading branch information
Alvin Šipraga authored and Linus Torvalds committed Dec 31, 2023
1 parent 453f5db commit 9c334eb
Showing 1 changed file with 17 additions and 13 deletions.
30 changes: 17 additions & 13 deletions scripts/get_maintainer.pl
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
use Cwd;
use File::Find;
use File::Spec::Functions;
use open qw(:std :encoding(UTF-8));

my $cur_path = fastgetcwd() . '/';
my $lk_path = "./";
Expand Down Expand Up @@ -445,7 +446,7 @@ sub maintainers_in_file {
my $text = do { local($/) ; <$f> };
close($f);

my @poss_addr = $text =~ m$[A-Za-zÀ-ÿ\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;
my @poss_addr = $text =~ m$[\p{L}\"\' \,\.\+-]*\s*[\,]*\s*[\(\<\{]{0,1}[A-Za-z0-9_\.\+-]+\@[A-Za-z0-9\.-]+\.[A-Za-z0-9]+[\)\>\}]{0,1}$g;
push(@file_emails, clean_file_emails(@poss_addr));
}
}
Expand Down Expand Up @@ -1152,6 +1153,17 @@ sub top_of_kernel_tree {
return 0;
}

sub escape_name {
my ($name) = @_;

if ($name =~ /[^\w \-]/ai) { ##has "must quote" chars
$name =~ s/(?<!\\)"/\\"/g; ##escape quotes
$name = "\"$name\"";
}

return $name;
}

sub parse_email {
my ($formatted_email) = @_;

Expand All @@ -1169,13 +1181,9 @@ sub parse_email {

$name =~ s/^\s+|\s+$//g;
$name =~ s/^\"|\"$//g;
$name = escape_name($name);
$address =~ s/^\s+|\s+$//g;

if ($name =~ /[^\w \-]/i) { ##has "must quote" chars
$name =~ s/(?<!\\)"/\\"/g; ##escape quotes
$name = "\"$name\"";
}

return ($name, $address);
}

Expand All @@ -1186,13 +1194,9 @@ sub format_email {

$name =~ s/^\s+|\s+$//g;
$name =~ s/^\"|\"$//g;
$name = escape_name($name);
$address =~ s/^\s+|\s+$//g;

if ($name =~ /[^\w \-]/i) { ##has "must quote" chars
$name =~ s/(?<!\\)"/\\"/g; ##escape quotes
$name = "\"$name\"";
}

if ($usename) {
if ("$name" eq "") {
$formatted_email = "$address";
Expand Down Expand Up @@ -2462,13 +2466,13 @@ sub clean_file_emails {
$name = "";
}

my @nw = split(/[^A-Za-zÀ-ÿ\'\,\.\+-]/, $name);
my @nw = split(/[^\p{L}\'\,\.\+-]/, $name);
if (@nw > 2) {
my $first = $nw[@nw - 3];
my $middle = $nw[@nw - 2];
my $last = $nw[@nw - 1];

if (((length($first) == 1 && $first =~ m/[A-Za-z]/) ||
if (((length($first) == 1 && $first =~ m/\p{L}/) ||
(length($first) == 2 && substr($first, -1) eq ".")) ||
(length($middle) == 1 ||
(length($middle) == 2 && substr($middle, -1) eq "."))) {
Expand Down

0 comments on commit 9c334eb

Please sign in to comment.