Java名称解析库?

我正在寻找一个类似于Perl Lingua :: EN :: NameParse模块的库。 从本质上讲,我想解析像’先生’这样的字符串。 Bob R. Smith’成为前缀,名字,姓氏和名称后缀组件。 谷歌在找到这样的东西方面没有多少帮助,如果可能的话,我宁愿不自己动手。 有人知道OSS Java库可以以复杂的方式完成吗?

我只是不敢相信有人没有为此共享一个库 – 我在github上查找了一个javascript名称解析器,可以很容易地转换为java: https : //github.com/joshfraser/JavaScript-Name-Parser

我还修改了其中一个答案中的代码,以便更好地工作并包含一个测试用例:

import java.util.ArrayList; import java.util.List; import org.apache.commons.lang.StringUtils; public class NameParser { private String firstName = ""; private String lastName = ""; private String middleName = ""; private List middleNames = new ArrayList(); private List titlesBefore = new ArrayList(); private List titlesAfter = new ArrayList(); private String[] prefixes = { "dr", "mr", "ms", "atty", "prof", "miss", "mrs" }; private String[] suffixes = { "jr", "sr", "ii", "iii", "iv", "v", "vi", "esq", "2nd", "3rd", "jd", "phd", "md", "cpa" }; public NameParser() { } public NameParser(String name) { parse(name); } private void reset() { firstName = lastName = middleName = ""; middleNames = new ArrayList(); titlesBefore = new ArrayList(); titlesAfter = new ArrayList(); } private boolean isOneOf(String checkStr, String[] titles) { for (String title : titles) { if (checkStr.toLowerCase().startsWith(title)) return true; } return false; } public void parse(String name) { if (StringUtils.isBlank(name)) return; this.reset(); String[] words = name.split(" "); boolean isFirstName = false; for (String word : words) { if (StringUtils.isBlank(word)) continue; if (word.charAt(word.length() - 1) == '.') { if (!isFirstName && !this.isOneOf(word, prefixes)) { firstName = word; isFirstName = true; } else if (isFirstName) { middleNames.add(word); } else { titlesBefore.add(word); } } else { if (word.endsWith(",")) word = StringUtils.chop(word); if (isFirstName == false) { firstName = word; isFirstName = true; } else { middleNames.add(word); } } } if (middleNames.size() > 0) { boolean stop = false; List toRemove = new ArrayList(); for (int i = middleNames.size() - 1; i >= 0 && !stop; i--) { String str = middleNames.get(i); if (this.isOneOf(str, suffixes)) { titlesAfter.add(str); } else { lastName = str; stop = true; } toRemove.add(str); } if (StringUtils.isBlank(lastName) && titlesAfter.size() > 0) { lastName = titlesAfter.get(titlesAfter.size() - 1); titlesAfter.remove(titlesAfter.size() - 1); } for (String s : toRemove) { middleNames.remove(s); } } } public String getFirstName() { return firstName; } public String getLastName() { return lastName; } public String getMiddleName() { if (StringUtils.isBlank(this.middleName)) { for (String name : middleNames) { middleName += (name + " "); } middleName = StringUtils.chop(middleName); } return middleName; } public List getTitlesBefore() { return titlesBefore; } public List getTitlesAfter() { return titlesAfter; } } 

测试用例:

 import junit.framework.Assert; import org.junit.Test; public class NameParserTest { private class TestData { String name; String firstName; String lastName; String middleName; public TestData(String name, String firstName, String middleName, String lastName) { super(); this.name = name; this.firstName = firstName; this.lastName = lastName; this.middleName = middleName; } } @Test public void test() { TestData td[] = { new TestData("Henry \"Hank\" J. Fasthoff IV", "Henry", "\"Hank\" J.", "Fasthoff"), new TestData("April A. (Caminez) Bentley", "April", "A. (Caminez)", "Bentley"), new TestData("fff lll", "fff", "", "lll"), new TestData("fff mmmmm lll", "fff", "mmmmm", "lll"), new TestData("fff mmm1 mm2 lll", "fff", "mmm1 mm2", "lll"), new TestData("Mr. Dr. Tom Jones", "Tom", "", "Jones"), new TestData("Robert P. Bethea Jr.", "Robert", "P.", "Bethea"), new TestData("Charles P. Adams, Jr.", "Charles", "P.", "Adams"), new TestData("B. Herbert Boatner, Jr.", "B.", "Herbert", "Boatner"), new TestData("Bernard H. Booth IV", "Bernard", "H.", "Booth"), new TestData("F. Laurens \"Larry\" Brock", "F.", "Laurens \"Larry\"", "Brock"), new TestData("Chris A. D'Amour", "Chris", "A.", "D'Amour") }; NameParser bp = new NameParser(); for (int i = 0; i < td.length; i++) { bp.parse(td[i].name); Assert.assertEquals(td[i].firstName, bp.getFirstName()); Assert.assertEquals(td[i].lastName, bp.getLastName()); Assert.assertEquals(td[i].middleName, bp.getMiddleName()); } } } 

也许你可以尝试GATE命名实体提取组件? 它建立了jape语法和地名词典列表,以提取名字,姓氏等。 见本页。

Apache Commons有HumanNameParser类。

https://commons.apache.org/sandbox/commons-text/jacoco/org.apache.commons.text.names/HumanNameParser.java.html

  Name nextName = parser.parse("James C. ('Jimmy') O'Dell, Jr.") String firstName = nextName.getFirstName(); String nickname = nextName.getNickName(); 

这个简单的代码可以帮助:

 import java.util.ArrayList; import java.util.List; public class NamesConverter { private List titlesBefore = new ArrayList<>(); private List titlesAfter = new ArrayList<>(); private String firstName = ""; private String lastName = ""; private List middleNames = new ArrayList<>(); public NamesConverter(String name) { String[] words = name.split(" "); boolean isTitleAfter = false; boolean isFirstName = false; int length = words.length; for (String word : words) { if (word.charAt(word.length() - 1) == '.') { if (isTitleAfter) { titlesAfter.add(word); } else { titlesBefore.add(word); } } else { isTitleAfter = true; if (isFirstName == false) { firstName = word; isFirstName = true; } else { middleNames.add(word); } } } if (middleNames.size() > 0) { lastName = middleNames.get(middleNames.size() - 1); middleNames.remove(lastName); } } public List getTitlesBefore() { return titlesBefore; } public List getTitlesAfter() { return titlesAfter; } public String getFirstName() { return firstName; } public String getLastName() { return lastName; } public List getMiddleNames() { return middleNames; } @Override public String toString() { String text = "Titles before :" + titlesBefore.toString() + "\n" + "First name :" + firstName + "\n" + "Middle names :" + middleNames.toString() + "\n" + "Last name :" + lastName + "\n" + "Titles after :" + titlesAfter.toString() + "\n"; return text; } } 

例如这个输入:

  NamesConverter ns = new NamesConverter("Mr. Dr. Tom Jones"); NamesConverter ns1 = new NamesConverter("Ing. Tom Ridley Bridley Furthly Murthly Jones CsC."); System.out.println(ns); System.out.println(ns1); 

有这个输出:

 Titles before :[Mr., Dr.] First name :Tom Middle names :[] Last name :Jones Titles after :[] Titles before :[Ing.] First name :Tom Middle names :[Ridley, Bridley, Furthly, Murthly] Last name :Jones Titles after :[CsC.] 

就个人而言,我会选择正则表达式 。 这是一个很好的介绍 。 他们快速,简洁, 能做你想做的事。

如果要保持在java sdk的边界内,请使用String tokenizers 。

更低级别的是JavaCC ,一个基于Java的解析器生成器。 这是教程的链接 。

javaCC的另一种选择是ANTLR ,我个人有很好的经验。