Parse surname and given name
Identify the presumed surname in a character
string assumed to represent a name and return
the result in a character matrix with
surname
followed by givenName
.
If only one name is provided (without
punctuation), it is assumed to be the
givenName
; see Wikipedia,
"Given name"
and "Surname".
parseName(x, surnameFirst=(median(regexpr(',', x))>0), suffix=c('Jr.', 'I', 'II', 'III', 'IV', 'Sr.', 'Dr.', 'Jr', 'Sr'), fixNonStandard=subNonStandardNames, removeSecondLine=TRUE, namesNotFound="attr.replacement", ...)
x |
a character vector |
surnameFirst |
logical: If TRUE, the surname comes first
followed by a comma (","), then the given
name. If FALSE, parse the surname from a
standard Western "John Smith, Jr." format.
If |
suffix |
character vector of strings that are NOT a surname but might appear at the end without a comma that would otherwise identify it as a suffix. |
fixNonStandard |
function to look for and repair
nonstandard names such as names
containing characters with accent marks
that are sometimes mangled
by different software. Use
|
removeSecondLine |
logical: If TRUE, delete anything
following "\n" and return it as
an attribute |
namesNotFound |
character vector passed to
|
... |
optional arguments
passed to |
If surnameFirst
is FALSE
:
1. If the last character is ")" and the matching "(" is 3 characters earlier, drop all that stuff. Thus, "John Smith (AL)" becomes "John Smith".
2. Look for commas to identify a suffix like Jr. or III; remove and call the rest x2.
3. split <- strsplit(x2, " ")
4. Take the last as the surname.
5. If the "surname" found per 3 is in
suffix
, save to append it to the
givenName
and recurse to get the
actual surname.
NOTE: This gives the wrong answer with double surnames written without a hyphen in the Spanish tradition, in which, e.g., "Anastasio Somoza Debayle", "Somoza Debayle" give the (first) surnames of Anastasio's father and mother, respectively: The current algorithm would return "Debayle" as the surname, which is incorrect.
6. Recompose the rest with any suffix as
the givenName
.
a character matrix with two columns:
surname and givenName
.
This matrix also has a
namesNotFound
attribute if one is
returned by subNonStandardNames
.
Spencer Graves
## ## 1. Parse standard first-last name format ## tstParse <- c('Joe Smith (AL)', 'Teresa Angelica Sanchez de Gomez', 'John Brown, Jr.', 'John Brown Jr.', 'John W. Brown III', 'John Q. Brown,I', 'Linda Rosa Smith-Johnson', 'Anastasio Somoza Debayle', 'Ra_l Vel_zquez', 'Sting', 'Colette, ') parsed <- parseName(tstParse) tstParse2 <- matrix(c('Smith', 'Joe', 'Gomez', 'Teresa Angelica Sanchez de', 'Brown', 'John, Jr.', 'Brown', 'John, Jr.', 'Brown', 'John W., III', 'Brown', 'John Q., I', 'Smith-Johnson', 'Linda Rosa', 'Debayle', 'Anastasio Somoza', 'Velazquez', 'Raul', '', 'Sting', 'Colette', ''), ncol=2, byrow=TRUE) # NOTE: The 'Anastasio Somoza Debayle' is in the Spanish tradition # and is handled incorrectly by the current algorithm. # The correct answer should be "Somoza Debayle", "Anastasio". # However, fixing that would complicate the algorithm excessively for now. colnames(tstParse2) <- c("surname", 'givenName') all.equal(parsed, tstParse2) ## ## 2. Parse "surname, given name" format ## tst3 <- c('Smith (AL),Joe', 'Sanchez de Gomez, Teresa Angelica', 'Brown, John, Jr.', 'Brown, John W., III', 'Brown, John Q., I', 'Smith-Johnson, Linda Rosa', 'Somoza Debayle, Anastasio', 'Vel_zquez, Ra_l', ', Sting', 'Colette,') tst4 <- parseName(tst3) tst5 <- matrix(c('Smith', 'Joe', 'Sanchez de Gomez', 'Teresa Angelica', 'Brown', 'John, Jr.', 'Brown', 'John W., III', 'Brown', 'John Q., I', 'Smith-Johnson', 'Linda Rosa', 'Somoza Debayle', 'Anastasio', 'Velazquez', 'Raul', '','Sting', 'Colette',''), ncol=2, byrow=TRUE) colnames(tst5) <- c("surname", 'givenName') all.equal(tst4, tst5) ## ## 3. secondLine ## L2 <- parseName(c('Adam\n2nd line', 'Ed \n --Vacancy', 'Frank')) # check L2. <- matrix(c('', 'Adam', '', 'Ed', '', 'Frank'), ncol=2, byrow=TRUE) colnames(L2.) <- c('surname', 'givenName') attr(L2., 'secondLine') <- c('2nd line', ' --Vacancy', NA) all.equal(L2, L2.) ## ## 4. Force surnameFirst when in a minority ## snf <- c('Sting', 'Madonna', 'Smith, Al') SNF <- parseName(snf, surnameFirst=TRUE) # check SNF2 <- matrix(c('', 'Sting', '', 'Madonna', 'Smith', 'Al'), ncol=2, byrow=TRUE) colnames(SNF2) <- c('surname', 'givenName') all.equal(SNF, SNF2) ## ## 5. nameNotFound ## noSub <- parseName('xx_x') # check noSub. <- matrix(c('', 'xx_x'), 1) colnames(noSub.) <- c('surname', 'givenName') attr(noSub., 'namesNotFound') <- 'xx_x' all.equal(noSub, noSub.)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.