Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tussenvoegsels / family name prefixes support #132

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 21 additions & 15 deletions nameparser/config/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# -*- coding: utf-8 -*-
"""
The :py:mod:`nameparser.config` module manages the configuration of the
nameparser.
nameparser.

A module-level instance of :py:class:`~nameparser.config.Constants` is created
and used by default for all HumanName instances. You can adjust the entire module's
Expand All @@ -25,11 +25,12 @@
>>> hn.parse_full_name() # need to run this again after config changes

**Potential Gotcha**: If you do not pass ``None`` as the second argument,
``hn.C`` will be a reference to the module config, possibly yielding
``hn.C`` will be a reference to the module config, possibly yielding
unexpected results. See `Customizing the Parser <customize.html>`_.
"""
from __future__ import unicode_literals
import sys

try:
# Python 3.3+
from collections.abc import Set
Expand All @@ -46,6 +47,7 @@
from nameparser.config.titles import TITLES
from nameparser.config.titles import FIRST_NAME_TITLES
from nameparser.config.regexes import REGEXES
from nameparser.config.affixes import AFFIXES

DEFAULT_ENCODING = 'UTF-8'

Expand All @@ -57,7 +59,7 @@ class SetManager(Set):

Only special functionality beyond that provided by set() is
to normalize constants for comparison (lower case, no periods)
when they are add()ed and remove()d and allow passing multiple
when they are add()ed and remove()d and allow passing multiple
string arguments to the :py:func:`add()` and :py:func:`remove()` methods.

'''
Expand Down Expand Up @@ -125,7 +127,7 @@ def remove(self, *strings):

class TupleManager(dict):
'''
A dictionary with dot.notation access. Subclass of ``dict``. Makes the tuple constants
A dictionary with dot.notation access. Subclass of ``dict``. Makes the tuple constants
more friendly.
'''

Expand All @@ -148,23 +150,25 @@ class Constants(object):
"""
An instance of this class hold all of the configuration constants for the parser.

:param set prefixes:
:param set prefixes:
:py:attr:`prefixes` wrapped with :py:class:`SetManager`.
:param set family prefixes:
:py:attr:`prefixes` wrapped with :py:class:`SetManager`.
:param set titles:
:param set titles:
:py:attr:`titles` wrapped with :py:class:`SetManager`.
:param set first_name_titles:
:param set first_name_titles:
:py:attr:`~titles.FIRST_NAME_TITLES` wrapped with :py:class:`SetManager`.
:param set suffix_acronyms:
:param set suffix_acronyms:
:py:attr:`~suffixes.SUFFIX_ACRONYMS` wrapped with :py:class:`SetManager`.
:param set suffix_not_acronyms:
:param set suffix_not_acronyms:
:py:attr:`~suffixes.SUFFIX_NOT_ACRONYMS` wrapped with :py:class:`SetManager`.
:param set conjunctions:
:param set conjunctions:
:py:attr:`conjunctions` wrapped with :py:class:`SetManager`.
:type capitalization_exceptions: tuple or dict
:param capitalization_exceptions:
:param capitalization_exceptions:
:py:attr:`~capitalization.CAPITALIZATION_EXCEPTIONS` wrapped with :py:class:`TupleManager`.
:type regexes: tuple or dict
:param regexes:
:param regexes:
:py:attr:`regexes` wrapped with :py:class:`TupleManager`.
"""

Expand All @@ -187,17 +191,17 @@ class Constants(object):
empty_attribute_default = ''
"""
Default return value for empty attributes.

.. doctest::

>>> from nameparser.config import CONSTANTS
>>> CONSTANTS.empty_attribute_default = None
>>> name = HumanName("John Doe")
>>> name.title
None
>>>name.first
'John'

"""

capitalize_name = False
Expand Down Expand Up @@ -233,6 +237,7 @@ class Constants(object):

def __init__(self,
prefixes=PREFIXES,
family_affixes=AFFIXES,
suffix_acronyms=SUFFIX_ACRONYMS,
suffix_not_acronyms=SUFFIX_NOT_ACRONYMS,
titles=TITLES,
Expand All @@ -242,6 +247,7 @@ def __init__(self,
regexes=REGEXES
):
self.prefixes = SetManager(prefixes)
self.family_affixes = SetManager(family_affixes)
self.suffix_acronyms = SetManager(suffix_acronyms)
self.suffix_not_acronyms = SetManager(suffix_not_acronyms)
self.titles = SetManager(titles)
Expand Down
123 changes: 123 additions & 0 deletions nameparser/config/affixes.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# -*- coding: utf-8 -*-
from __future__ import unicode_literals

# https://en.wikipedia.org/wiki/List_of_family_name_affixes

AFFIXES = set([
'a',
'ab',
'af',
'av',
'ap',
'abu',
'ait',
'aït',
'alam',
'at',
'ath',
'aust',
'austre',
'bar',
'bat',
'bath',
'ben',
'bin',
'ibn',
'bert',
'bet',
'bint',
'da',
'das',
'de',
'degli',
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of these affixes are already defined in prefixes.py.

It seems like mainly what you want is the parts that come before a last name and currently get added to it (wether we call them prefixes or affixes or tussenvoegsels) separately from the last name.

'del',
'dele',
'della',
'den',
'der',
'di',
'dos',
'du',
'e',
'el',
'fetch',
'vetch',
'fitz',
'i',
'kil',
'gil',
'la',
'le',
'lille',
'lu',
'm\'',
'mc',
'mac',
'mck',
'mhic',
'mic',
'mala',
'mellom',
'myljom',
'na',
'ned',
'nedre',
'neder',
'nic',
'ni',
'ní',
'nin',
'nord',
'norr',
'nord',
'nordre',
'ny',
'o',
'ua',
'ua',
'ui',
'uí',
'opp',
'upp',
'ofver',
'ost',
'oster',
'over',
'ovste',
'ovre',
'oz',
'pour',
'putra',
'putera',
'putri',
'putera',
'setia',
'setya',
'stor',
'soder',
'sor',
'sonder',
'syd',
'sondre',
'syndre',
'sore',
'ter',
'\'t',
'tre',
'van',
'het',
'de',
'vast',
'väst',
'vaster',
'väster',
'verch',
'erch',
'vest',
'vestre',
'vesle',
'vetle',
'von',
'war',
'zu',
])
44 changes: 42 additions & 2 deletions nameparser/parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,8 @@ class HumanName(object):
* :py:attr:`suffix`
* :py:attr:`nickname`
* :py:attr:`surnames`
* :py:attr:`family`
* :py:attr:`family_prefix`

:param str full_name: The name string to be parsed.
:param constants constants:
Expand Down Expand Up @@ -300,6 +302,16 @@ def last(self):
"""
return " ".join(self.last_list) or self.C.empty_attribute_default

@property
def family(self):
"""
The person's family name.
"""
s = ""
for affix, family in self.family_list:
s += " ".join([*affix, *family]) or self.C.empty_attribute_default
return s

@property
def suffix(self):
"""
Expand Down Expand Up @@ -399,6 +411,19 @@ def is_prefix(self, piece):
else:
return lc(piece) in self.C.prefixes

def is_family_affix(self, piece):
"""
Lowercase and no periods version of piece is in the
:py:data:`~nameparser.config.family_affixes.AFFIXES` set.
"""
if isinstance(piece, list):
for item in piece:
if self.is_family_affix(item):
return True
else:
return lc(piece) in self.C.family_affixes


def is_roman_numeral(self, value):
"""
Matches the ``roman_numeral`` regular expression in
Expand Down Expand Up @@ -513,9 +538,9 @@ def parse_nicknames(self):
Loops through 3 :py:data:`~nameparser.config.regexes.REGEXES`;
`quoted_word`, `double_quotes` and `parenthesis`.
"""

empty_re = re.compile("")

re_quoted_word = self.C.regexes.quoted_word or empty_re
re_double_quotes = self.C.regexes.double_quotes or empty_re
re_parenthesis = self.C.regexes.parenthesis or empty_re
Expand Down Expand Up @@ -563,6 +588,7 @@ def parse_full_name(self):
self.first_list = []
self.middle_list = []
self.last_list = []
self.family_list = []
self.suffix_list = []
self.nickname_list = []
self.unparsable = True
Expand Down Expand Up @@ -699,6 +725,19 @@ def parse_full_name(self):
except IndexError:
pass

for last in self.last_list:
if " " in last:
affix = []
family = []
for part in last.split(" "):
if self.is_family_affix(part):
affix.append(part)
else:
family.append(part)
self.family_list.append([affix, family])
else:
self.family_list.append([[], [last]])

if len(self) < 0:
log.info("Unparsable: \"%s\" ", self.original)
else:
Expand Down Expand Up @@ -968,6 +1007,7 @@ def capitalize(self, force=None):
self.first_list = self.cap_piece(self.first, 'first').split(' ')
self.middle_list = self.cap_piece(self.middle, 'middle').split(' ')
self.last_list = self.cap_piece(self.last, 'last').split(' ')
# self.family_list = self.cap_piece(self.family, 'family').split(' ')
self.suffix_list = self.cap_piece(self.suffix, 'suffix').split(', ')

def handle_capitalization(self):
Expand Down
14 changes: 14 additions & 0 deletions tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,20 @@ def test_prefix_names(self):
self.m(hn.first, "vai", hn)
self.m(hn.last, "la", hn)

def test_family_name_and_prefix(self):
hn = HumanName("Vincent van Gogh")
self.m(hn.family, "van Gogh", hn)
self.assertEqual(hn.family_list, [
[["van"], ["Gogh"]]
])

def test_family_name_and_double_prefix(self):
hn = HumanName("Vincent van der Gogh")
self.m(hn.family, "van der Gogh", hn)
self.assertEqual(hn.family_list, [
[["van", "der"], ["Gogh"]],
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to have these pieces in a separate attribute, but I'd lean more towards having separate attributes for those 2 nested lists. Maybe "affixes" and "family name", which combine together to give "last name", (which, btw, combines with middle names to give "surnames")?

We are kind of running out of names for things. And Wikipedia thinks that Surname, Last name, and family name are all the same thing, so there's that. 😄

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, Google translate says the English translation of tussenvoegsels is "infixes". Really? There's yet another one? I did not know that was a word. (It's a good day when I can learn a new English word, so thanks!) But I guess if we're desperate, we could use that.

Copy link
Owner

@derek73 derek73 Feb 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kinda makes me wonder if we could do something more useful with slices, maybe something like each bucket gets an index value and you can slice off the parts you want/don't want?

[0:titles, 1:first, 2:middle, 3:prefixes, 4:last, 5:suffixes]

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, that is already how slice works. Been so long since I implemented it that I forgot. It returns the concatenated string of whichever members you slice.

So, if we added prefixes in there you could get just the last name with a slice, ex: hn[4:4].

_members = ['title', 'first', 'middle', 'prefix', 'last', 'suffix', 'nickname']

Mostly what we need is just to have the prefixes in their own bucket, then we can be more flexible when we display things. Probably means changing the parse tree to add prefix in there.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I dive into the surname/last name/family name, the more confused I get... There doesn't seem to be a clear definition of any of these things.

I am missing some entries in the prefixes hence the reason I made a separate list not to interfere with your work. I was looking into possible locales but that seems to be a road I really don't want to go down...

Mostly what we need is just to have the prefixes in their own bucket, then we can be more flexible when we display things. Probably means changing the parse tree to add prefix in there.

I tried Vincent van Gogh van Beethoven which gave me van Gogh van as a middle name. I dropped that attempt for now but that is the reason I created a nested list with pairs, prefix - family name. Otherwise it would give me simply van twice, and (possibly) format it as van van Gogh Beethoven.

Copy link
Owner

@derek73 derek73 Feb 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There used to be similar conversations re: "title", and "suffix" because people got tripped up on the semantic meaning of a title vs a suffix, ex "Dr" and "MD" are kinda the same but they appear in different places in the name. I think the best strategy for the parser is to focus on the position of words (name parts) in the string, and special words that join with words before/after/around them in certain conditions/parts of the name. In the future I should probably choose names that focus on that positional information and avoid semantic meaning.

We can add some of those things from your affixes constants to prefixes. The only danger is ones that could also be first names, like "fitz, "mala" and "ned". I think the current handling for prefixes skips the first name though, so it could be fine to include those.

I tried Vincent van Gogh van Beethoven which gave me van Gogh van as a middle name

👎 That's annoying. Doesn't seem like that was my intention:

# join everything after the prefix until the next prefix or suffix
try:
if i == 0 and total_length >= 1:
# If it's the first piece and there are more than 1 rootnames, assume it's a first name
continue
next_prefix = next(iter(filter(self.is_prefix, pieces[i + 1:])))
j = pieces.index(next_prefix)
if j == i + 1:
# if there are two prefixes in sequence, join to the following piece
j += 1
new_piece = ' '.join(pieces[i:j])
pieces = pieces[:i] + [new_piece] + pieces[j:]
except StopIteration:
try:
# if there are no more prefixes, look for a suffix to stop at
stop_at = next(iter(filter(self.is_suffix, pieces[i + 1:])))
j = pieces.index(stop_at)
new_piece = ' '.join(pieces[i:j])
pieces = pieces[:i] + [new_piece] + pieces[j:]
except StopIteration:
# if there were no suffixes, nothing to stop at so join all
# remaining pieces
new_piece = ' '.join(pieces[i:])
pieces = pieces[:i] + [new_piece]

])

def test_blank_name(self):
hn = HumanName()
self.m(hn.first, "", hn)
Expand Down