-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tussenvoegsels / family name prefixes support #132
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,123 @@ | ||
# -*- coding: utf-8 -*- | ||
from __future__ import unicode_literals | ||
|
||
# https://en.wikipedia.org/wiki/List_of_family_name_affixes | ||
|
||
AFFIXES = set([ | ||
'a', | ||
'ab', | ||
'af', | ||
'av', | ||
'ap', | ||
'abu', | ||
'ait', | ||
'aït', | ||
'alam', | ||
'at', | ||
'ath', | ||
'aust', | ||
'austre', | ||
'bar', | ||
'bat', | ||
'bath', | ||
'ben', | ||
'bin', | ||
'ibn', | ||
'bert', | ||
'bet', | ||
'bint', | ||
'da', | ||
'das', | ||
'de', | ||
'degli', | ||
'del', | ||
'dele', | ||
'della', | ||
'den', | ||
'der', | ||
'di', | ||
'dos', | ||
'du', | ||
'e', | ||
'el', | ||
'fetch', | ||
'vetch', | ||
'fitz', | ||
'i', | ||
'kil', | ||
'gil', | ||
'la', | ||
'le', | ||
'lille', | ||
'lu', | ||
'm\'', | ||
'mc', | ||
'mac', | ||
'mck', | ||
'mhic', | ||
'mic', | ||
'mala', | ||
'mellom', | ||
'myljom', | ||
'na', | ||
'ned', | ||
'nedre', | ||
'neder', | ||
'nic', | ||
'ni', | ||
'ní', | ||
'nin', | ||
'nord', | ||
'norr', | ||
'nord', | ||
'nordre', | ||
'ny', | ||
'o', | ||
'ua', | ||
'ua', | ||
'ui', | ||
'uí', | ||
'opp', | ||
'upp', | ||
'ofver', | ||
'ost', | ||
'oster', | ||
'over', | ||
'ovste', | ||
'ovre', | ||
'oz', | ||
'pour', | ||
'putra', | ||
'putera', | ||
'putri', | ||
'putera', | ||
'setia', | ||
'setya', | ||
'stor', | ||
'soder', | ||
'sor', | ||
'sonder', | ||
'syd', | ||
'sondre', | ||
'syndre', | ||
'sore', | ||
'ter', | ||
'\'t', | ||
'tre', | ||
'van', | ||
'het', | ||
'de', | ||
'vast', | ||
'väst', | ||
'vaster', | ||
'väster', | ||
'verch', | ||
'erch', | ||
'vest', | ||
'vestre', | ||
'vesle', | ||
'vetle', | ||
'von', | ||
'war', | ||
'zu', | ||
]) |
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -187,6 +187,20 @@ def test_prefix_names(self): | |||||||||||||||||||||||||||||||||||||||||||||||||||
self.m(hn.first, "vai", hn) | ||||||||||||||||||||||||||||||||||||||||||||||||||||
self.m(hn.last, "la", hn) | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
def test_family_name_and_prefix(self): | ||||||||||||||||||||||||||||||||||||||||||||||||||||
hn = HumanName("Vincent van Gogh") | ||||||||||||||||||||||||||||||||||||||||||||||||||||
self.m(hn.family, "van Gogh", hn) | ||||||||||||||||||||||||||||||||||||||||||||||||||||
self.assertEqual(hn.family_list, [ | ||||||||||||||||||||||||||||||||||||||||||||||||||||
[["van"], ["Gogh"]] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
]) | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
def test_family_name_and_double_prefix(self): | ||||||||||||||||||||||||||||||||||||||||||||||||||||
hn = HumanName("Vincent van der Gogh") | ||||||||||||||||||||||||||||||||||||||||||||||||||||
self.m(hn.family, "van der Gogh", hn) | ||||||||||||||||||||||||||||||||||||||||||||||||||||
self.assertEqual(hn.family_list, [ | ||||||||||||||||||||||||||||||||||||||||||||||||||||
[["van", "der"], ["Gogh"]], | ||||||||||||||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it makes sense to have these pieces in a separate attribute, but I'd lean more towards having separate attributes for those 2 nested lists. Maybe "affixes" and "family name", which combine together to give "last name", (which, btw, combines with middle names to give "surnames")? We are kind of running out of names for things. And Wikipedia thinks that Surname, Last name, and family name are all the same thing, so there's that. 😄 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, Google translate says the English translation of tussenvoegsels is "infixes". Really? There's yet another one? I did not know that was a word. (It's a good day when I can learn a new English word, so thanks!) But I guess if we're desperate, we could use that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Kinda makes me wonder if we could do something more useful with slices, maybe something like each bucket gets an index value and you can slice off the parts you want/don't want?
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, that is already how slice works. Been so long since I implemented it that I forgot. It returns the concatenated string of whichever members you slice. So, if we added prefixes in there you could get just the last name with a slice, ex:
Mostly what we need is just to have the prefixes in their own bucket, then we can be more flexible when we display things. Probably means changing the parse tree to add prefix in there. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The more I dive into the surname/last name/family name, the more confused I get... There doesn't seem to be a clear definition of any of these things. I am missing some entries in the prefixes hence the reason I made a separate list not to interfere with your work. I was looking into possible locales but that seems to be a road I really don't want to go down...
I tried There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There used to be similar conversations re: "title", and "suffix" because people got tripped up on the semantic meaning of a title vs a suffix, ex "Dr" and "MD" are kinda the same but they appear in different places in the name. I think the best strategy for the parser is to focus on the position of words (name parts) in the string, and special words that join with words before/after/around them in certain conditions/parts of the name. In the future I should probably choose names that focus on that positional information and avoid semantic meaning. We can add some of those things from your affixes constants to prefixes. The only danger is ones that could also be first names, like "fitz, "mala" and "ned". I think the current handling for prefixes skips the first name though, so it could be fine to include those.
👎 That's annoying. Doesn't seem like that was my intention: python-nameparser/nameparser/parser.py Lines 879 to 903 in 3efe171
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
]) | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
def test_blank_name(self): | ||||||||||||||||||||||||||||||||||||||||||||||||||||
hn = HumanName() | ||||||||||||||||||||||||||||||||||||||||||||||||||||
self.m(hn.first, "", hn) | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many of these affixes are already defined in prefixes.py.
It seems like mainly what you want is the parts that come before a last name and currently get added to it (wether we call them prefixes or affixes or tussenvoegsels) separately from the last name.