获取URL的二级域(java)

我想知道在java中是否有解析器或库用于提取URL中的二级域(SLD) – 或者没有使用算法或正则表达式来执行相同操作。 例如:

URI uri = new URI("http://www.mydomain.ltd.uk/blah/some/page.html"); String host = uri.getHost(); System.out.println(host); 

打印:

 mydomain.ltd.uk 

现在我想做的是强有力地识别SLD(“ltd.uk”)组件。 有任何想法吗?

编辑:我理想地寻找一般解决方案,所以我在“police.uk”中匹配“.uk”,在“bbc.co.uk”中匹配“.co.uk”,在“amazon”中匹配“.com” .COM”。

谢谢

不知道你的目的,但二级域名可能对你没有多大意义。 您可能需要找到公共后缀 ,其下方的域名正是您要查找的内容。

Apache Http Component(HttpClient 4)附带了处理这个问题的类,

 org.apache.http.impl.cookie.PublicSuffixFilter org.apache.http.impl.cookie.PublicSuffixListParser 

你需要从这里下载公共后缀列表,

http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1

在这里重复一切后,正确的解决方案应该是(用番石榴)

。InternetDomainName.from(uriHost).topPrivateDomain()的toString();

使用Guava获取私有域名时出错

在查看这些答案并且不满意之后,我使用com.google.common.net.InternetDomainName类从所有部分中减去域名的公共部分:

 Set nonePublicDomainParts(String uriHost) { InternetDomainName fullDomainName = InternetDomainName.from(uriHost); InternetDomainName publicDomainName = fullDomainName.publicSuffix(); Set nonePublicParts = new HashSet(fullDomainName.parts()); nonePublicParts.removeAll(publicDomainName.parts()); return nonePublicParts; } 

该类在番石榴库中的maven上:

   com.google.guava guava 10.0.1 compile  

在内部,这个类使用的是TldPatterns.class,它是包私有的,并且包含了顶级域名列表。

有趣的是,如果您在下面的链接中查看该类源,它会明确地将“police.uk”列为私有域名。 这是正确的,因为police.uk是一个由警方控制的私人域名; 其他criminals.police.uk将通过电子邮件向您询问有关他们正在进行的卡欺诈调查的信用卡详细信息;)

http://code.google.com/p/guava-libraries/source/browse/guava/src/com/google/common/net/TldPatterns.java?spec=svn8c3cc7e67132f8dcaae4bd214736a8ddf6611769&r=8c3cc7e67132f8dcaae4bd214736a8ddf6611769

选择的答案是最好的方法。 对于那些不想编码的人,我就是这样做的。

首先,我不明白org.apache.http.impl.cookie.PublicSuffixFilter,或者它有一个错误。

基本上,如果您通过google.com,它会正确返回false。 如果您传入google.com.au,则会错误地返回true。 错误在应用模式的代码中,例如* .au。

以下是基于org.apache.http.impl.cookie.PublicSuffixFilter的检查器代码:

 public class TopLevelDomainChecker { private Set exceptions; private Set suffixes; public void setPublicSuffixes(Collection suffixes) { this.suffixes = new HashSet(suffixes); } public void setExceptions(Collection exceptions) { this.exceptions = new HashSet(exceptions); } /** * Checks if the domain is a TLD. * @param domain * @return */ public boolean isTLD(String domain) { if (domain.startsWith(".")) domain = domain.substring(1); // An exception rule takes priority over any other matching rule. // Exceptions are ones that are not a TLD, but would match a pattern rule // eg bl.uk is not a TLD, but the rule *.uk means it is. Hence there is an exception rule // stating that bl.uk is not a TLD. if (this.exceptions != null && this.exceptions.contains(domain)) return false; if (this.suffixes == null) return false; if (this.suffixes.contains(domain)) return true; // Try patterns. ie *.jp means that boo.jp is a TLD int nextdot = domain.indexOf('.'); if (nextdot == -1) return false; domain = "*" + domain.substring(nextdot); if (this.suffixes.contains(domain)) return true; return false; } public String extractSLD(String domain) { String last = domain; boolean anySLD = false; do { if (isTLD(domain)) { if (anySLD) return last; else return ""; } anySLD = true; last = domain; int nextDot = domain.indexOf("."); if (nextDot == -1) return ""; domain = domain.substring(nextDot+1); } while (domain.length() > 0); return ""; } } 

和解析器。 我改名了。

 /** * Parses the list from publicsuffix.org * Copied from http://svn.apache.org/repos/asf/httpcomponents/httpclient/trunk/httpclient/src/main/java/org/apache/http/impl/cookie/PublicSuffixListParser.java */ public class TopLevelDomainParser { private static final int MAX_LINE_LEN = 256; private final TopLevelDomainChecker filter; TopLevelDomainParser(TopLevelDomainChecker filter) { this.filter = filter; } public void parse(Reader list) throws IOException { Collection rules = new ArrayList(); Collection exceptions = new ArrayList(); BufferedReader r = new BufferedReader(list); StringBuilder sb = new StringBuilder(256); boolean more = true; while (more) { more = readLine(r, sb); String line = sb.toString(); if (line.length() == 0) continue; if (line.startsWith("//")) continue; //entire lines can also be commented using // if (line.startsWith(".")) line = line.substring(1); // A leading dot is optional // An exclamation mark (!) at the start of a rule marks an exception to a previous wildcard rule boolean isException = line.startsWith("!"); if (isException) line = line.substring(1); if (isException) { exceptions.add(line); } else { rules.add(line); } } filter.setPublicSuffixes(rules); filter.setExceptions(exceptions); } private boolean readLine(Reader r, StringBuilder sb) throws IOException { sb.setLength(0); int b; boolean hitWhitespace = false; while ((b = r.read()) != -1) { char c = (char) b; if (c == '\n') break; // Each line is only read up to the first whitespace if (Character.isWhitespace(c)) hitWhitespace = true; if (!hitWhitespace) sb.append(c); if (sb.length() > MAX_LINE_LEN) throw new IOException("Line too long"); // prevent excess memory usage } return (b != -1); } } 

最后,如何使用它

  FileReader fr = new FileReader("effective_tld_names.dat.txt"); TopLevelDomainChecker checker = new TopLevelDomainChecker(); TopLevelDomainParser parser = new TopLevelDomainParser(checker); parser.parse(fr); boolean result; result = checker.isTLD("com"); // true result = checker.isTLD("com.au"); // true result = checker.isTLD("ltd.uk"); // true result = checker.isTLD("google.com"); // false result = checker.isTLD("google.com.au"); // false result = checker.isTLD("metro.tokyo.jp"); // false String sld; sld = checker.extractSLD("com"); // "" sld = checker.extractSLD("com.au"); // "" sld = checker.extractSLD("google.com"); // "google.com" sld = checker.extractSLD("google.com.au"); // "google.com.au" sld = checker.extractSLD("www.google.com.au"); // "google.com.au" sld = checker.extractSLD("www.google.com"); // "google.com" sld = checker.extractSLD("foo.bar.hokkaido.jp"); // "foo.bar.hokkaido.jp" sld = checker.extractSLD("moo.foo.bar.hokkaido.jp"); // "foo.bar.hokkaido.jp" 
  1. 提到的列表+阅读维基百科更新提供了98%正确的TLD列表
  2. 通过http://www.iana.org/domains/root/db/进入自己并点击每个网页并查看最新消息,为您提供其他2%(如.com.aq和.gov.an)
  3. 不幸的是,大型“免费网站空间”提供商是另一个需要考虑的因素,例如无数* .blogspot.com域名,如果您下载alexa top 100.000(免费csv文件),您至少可以很好地了解这些中最常用的这应该让你获得这些领域的特定百分比(例如,当将alexa评级与stumbleupon综合评分与美味书签进行比较时)(alexa有时只占用topdomain而美味真的md5的每个url,所以1 alexa – >多个美味的md5哈希
  4. 除了有时在twitter的情况下,如果你正在寻找评价某些东西的唯一性,那么在/之后发生的事情也很重要。

以下是Alexa排名前40,000的清单,当真实顶级域名被过滤掉时会给你一种感觉:(这意味着Alexa不会将域名的评级统计为以下内容):

bp.blogspot.com — espn.go.com — files.wordpress.com — abcnews.go.com — disney.go.com — troktiko.blogspot.com —恩。 wordpress.com—api.ning.com—abc.go.com—220.181.38.82—213.174.154.20—abclocal.go.com—feedproxy.google.com/~r —forums.wordpress.com—googleblog.blogspot.com—1.cnm999.com/user/10008—213.174.143.196—92.42.51.201—googlewebmastercentral.blogspot.com– -myespn.go.com—213.174.143.197—61.132.221.146—support.wordpress.com—dashboard.wordpress.com—sethgodin.typepad.com—paygo.17zhifu.com /user/10005—go2.wordpress.com—1.1.1.1—movies.go.com—home.comcast.net—googlesystem.blogspot.com—abcfamily.go.com —home.spaces.live.com—196.1.237.210—kaixin001.com/~record—xhamster.com/user/video—gold-oil-commodity.blogspot.com— journeyplanner.tfl.gov.uk/user/XSLT_TRIP_REQUEST2—206.108.48.238—blog.wordpress.com—67.220.92.21—183.101.80.130—211.94.190.80—youtube-global。 blogspot.com—uta-net.com/user/phplib—cinema3satu.blogspot.com—119.147 .41.16 — sites.google.com/site/sites—kk.iij4u.or.jp/~dyo—220.181.6.19—toontown.go.com—signup.wordpress.com- –thesartorialist.blogspot.com—analytics.blogspot.com—ss.iij4u.or.jp/~ceh2—67.220.92.23—gmailblog.blogspot.com—183.99.121.86– -vgorode.ru/user/create—61.132.216.243—217.175.53.72—labnol.blogspot.com—adsense.blogspot.com—subscribe.wordpress.com—fimotro.blogspot .COM — creators.ning.com — sarkari-naukri.blogspot.com — search.wordpress.com — orange-hiyoko.blogspot.com — cashewmaniakpop.wordpress.com — pixiehollow .go.com — adwords.blogspot.com—202.53.226.102—lorelle.wordpress.com—homestead.com/~site—multiply.com/user/signout—221.231。 148.249—183.101.80.77—windowsliveintro.spaces.live.com—124.228.254.234—streaming-web.blogspot.com—id.tianya.cn/user/message—familyfun。 go.com—tro-ma-ktiko.blogspot.com—about.ning.com—paygo.17zhifu.com/user/10020—tututina.blogspot.com—toolserver.org/ 〜geohack — superjob.ru/user/resume—ejobs.ro/use R / locuri-DE-munca — gnula.blogspot.com—alles.or.jp/~uir—chiark.greenend.org.uk/~sgtatham—woork.blogspot.com— 88.208.32.218—webstreamingmania.blogspot.com—spaces.live.com—youtube.com/user/RayWilliamJohnson—cloob.com/user/login—asstr.org/~Kristen– -getclicky.com/user/login—guesshermuff.blogspot.com—211.98.70.195—222.73.105.196—pp.iij4u.or.jp/~taakii—unsoloclic.blogspot.com- –photoshopdisasters.blogspot.com—218.83.161.253—217.16.18.163—217.16.18.207—217.16.28.104—222.73.105.210—youtube.com/user/OldSpice— hubpages.com/user/new—pelisdvdripdd.blogspot.com—95.143.193.60—es.wordpress.com—217.16.18.206—61.147.116.146—damncoolpics.blogspot.com- –family.go.com—81.176.235.162—gutteruncensorednewsr.blogspot.com—terselubung.blogspot.com—faisalardhy.blogspot.com—67.220.92.14—goodreads.com/用户/节目— 116.228.55.34—profile.typepad.com—kaixin001.com/~truth—linkbuildersassociated.ning.com—nicotto.jp/user/mypage—ritemail.blogspot 。C om—hyperboleandahalf.blogspot.com—carscoop.blogspot.com—tubemogul.com/user/dash—press-gr.blogspot.com—81.176.235.164—soapnet.go。 com—208.98.30.69—trelokouneli.blogspot.com—help.ning.com—id.tianya.cn/user/register—slovari.yandex.ru/~%D0%BA% D0%BD%D0%B8%D0%B3%D0%B8 — printable-coupons.blogspot.com — unic77.blogspot.com — globaleconomicanalysis.blogspot.com — 183.101.80.68 — 221.194.33.60—doujin-games88.blogspot.com—magaseek.com/user/SearchProducts—files.posterous.com—wwwnew.splinder.com—kolom-tutorial.blogspot.com- –strobist.blogspot.com—67.21.91.73—needanarticle.com/user/activity—forum.moe.gov.om/~moeoman—milasdaydreams.blogspot.com—88.208.17.189 —67.220.92.22—115.238.100.211—nonews-news.blogspot.com—testosterona.blog.br—nn.iij4u.or.jp/~has—cs.tut。 FI /〜jkorpela — youtube.com/user/oldspice—67.159.53.25—taxalia.blogspot.com—208.98.30.70—filmesporno.blog.br—alles-schallundrauch.blogspot .COM — vatera.hu/user/account—78.140.136.18 2—us.my.alibaba.com/user/join—stores.homestead.com—pes2008editing.blogspot.com—ocn.ne.jp/~matrix—adweek.blogs.com —115.238.55.94—markjaquith.wordpress.com—k3.dion.ne.jp/~dreamlov—38.99.186.222—film.tv.it—android-developers.blogspot。 com—217.218.110.147—kadokado.com/user/login—bollyvideolinks4u.blogspot.com—sookyeong.wordpress.com—87.101.230.11—livecodes.blogspot.com— 67.220.91.19—homepage2.nifty.com/bustered—pp.iij4u.or.jp/~manga100—110.173.49.202—erogamescape.dyndns.org/~ap2—cs.berkeley。埃杜/〜洛奇— cakewrecks.blogspot.com — 59.106.117.185 — 119.75.213.61 — id.wordpress.com — de.wordpress.com — telefilmdblink.blogspot.com– -61.139.105.138—multiply.com/user/join—programseo.blogspot.com—collectivebias.ning.com—bablorub.blogspot.com—thinkexist.com/user/personalAccount– -us.my.alibaba.com/user/sign—66.70.56.90—getsarkari-naukri.blogspot.com—59.106.117.183—productreviewplace.ning.com—support.weebly.com —kaixin001.com/~lucky— football-russia.blogspot.com—magaseek.com/user/ItemDetail—polprav.blogspot.com—atlasshrugs2000.typepad.com—jpn-manga.blogspot.com—88.208.32.219- –google-latlong.blogspot.com—59.106.117.188—erogamescape.ddo.jp/~ap2—218.87.32.245—watchhorrormovies.blogspot.com—sarotiko.blogspot.com– -googlewebmastercentral-de.blogspot.com—colmeia.blog.br—us.my.alibaba.com/user/webatm—220.170.79.109—darkville.blogspot.com—youtube.com /user/PiMPDailyDose—disneymovierewards.go.com—fukuoka.lg.jp—61.147.115.16—iisc.ernet.in—youtube.com/user/HuskyStarcraft—202.108.212.211 —homepage3.nifty.com/otakarando—94.77.215.37—pitchit.ning.com—59.106.117.186—thestar.blogs.com—1.254.254.254—piratesonline.go .COM — animedblink.blogspot.com—137.32.44.152—eurus.dti.ne.jp/~yoneyama—state.la.us—lastminute.is.it—bangpai。 taobao.com/user/groups—csse.monash.edu.au/~jwb—jquery-howto.blogspot.com—sakura.ne.jp/~moesino—users.skynet.be/ mgueury — saitama.lg.jp — POR taldasfinancas.gov.pt — bnonline.fi.cr — 135.125.60.11 — zhuhai.gd.cn — kuna.net.kw — 59.175.213.77 — 58.218.199.7 — multiply.com/user/signin—youtube.com/user/HDstarcraft—blinklist.com/user/join—us.my.alibaba.com/user/company—jptwitterhelp.blogspot.com- –67.220.92.017—88.208.17.51—youtube.com/user/GoogleWebmasterHelp—208.53.156.229—filmdblink.blogspot.com—blinklist.com/user/signup—3arbtop。 blogspot.com—attivissimo.blogspot.com—onlinemovie12.blogspot.com—98.126.189.86—mytvsource.blogspot.com—blinklist.com/user/login—googlejapan.blogspot。 com—76.73.65.166—gutteruncensorednewsb.blogspot.com—issuu.com/user/upload—86.51.174.18—88.208.17.120—profile.china.alibaba.com/user/ admin—jntuworldportal.blogspot.com—sz.js.cn—disneymovieclub.go.com—a1.com.mk—dd.iij4u.or.jp/~madonna—rr .iij4u.or.jp /〜等离子体— — mlmlaunchformula.ning.com — 112.78.7.151 — blogdelatele.blogspot.com — googlemobile.blogspot.com — 78.109.199.240 WSU。埃杜/〜brians — internapoli -city.blogspot.com—hh.iij4u.or.jp/~dmt—kaixin001.com/~house—61.155.11.14—youtube.com/user/SHAYTARDS—turbobit.net /user/files—qjy168.com/user/do—hubpages.com/user/finished—upload2.dyndns.org—f32.aaa.livedoor.jp/~azusa—naruto- spoilers.blogspot.com—205.209.140.195—193.227.20.21—adsenseforfeeds.blogspot.com—group.ameba.jp/user/groups—

我对你的具体案例没有答案 – 而Jonathan的评论指出你应该重构你的问题。

不过,我建议看一下Restlet项目的Reference类。 它有很多有用的方法。 由于Restlet是开源的,因此您不必使用整个库 – 您可以下载源代码并将该类添加到您的项目中。

这是你想要的。 publicSuffix

1。

方法来自simbo1905贡献的nonePublicDomainParts应该因为包含"."的TLD而得到纠正"." ,例如"com.ac"

input: "com.abc.com.ac"

output: "abc"

正确的输出是"com.abc"

要获得SLD您可以使用方法publicSuffix()从给定域中删除TLD

2。

不应使用集合,因为包含相同部分的域,例如:

input: part1.part2.part1.TLD

output: part1, part2

正确的输出是: part1, part2, part1或者forms为part1.part2.part1

因此,而不是Set使用List

 public static String getTopLevelDomain(String uri) { InternetDomainName fullDomainName = InternetDomainName.from(uri); InternetDomainName publicDomainName = fullDomainName.topPrivateDomain(); String topDomain = ""; Iterator it = publicDomainName.parts().iterator(); while(it.hasNext()){ String part = it.next(); if(!topDomain.isEmpty())topDomain += "."; topDomain += part; } return topDomain; } 

只需提供域名,您将获得顶级域名。 从http://code.google.com/p/guava-libraries/下载jar文件

Dnspy是publicsuffix lib的另一个更灵活的替代品。

如果您想要二级域名,可以将字符串拆分为“。” 并采取最后两部分。 当然,这假设您总是拥有一个不属于该网站的二级域名(因为它听起来就像您想要的那样)。