如何从任何字符串url获取网站名称

我给了String,其中包含任何有效的URL。 我必须从给定的URL找到网站的名称。 我也忽略了子域名。

喜欢

http://www.yahoo.com => yahoo www.google.co.in => google http://in.com => in http://india.gov.in/ => india https://in.yahoo.com/ => yahoo http://philotheoristic.tumblr.com/ =>tumblr http://philotheoristic.tumblr.com/ https://in.movies.yahoo.com/ =>yahoo 

这该怎么做

哟可以使用URL

来自文档 – http://docs.oracle.com/javase/tutorial/networking/urls/urlInfo.html

 import java.net.*; import java.io.*; public class ParseURL { public static void main(String[] args) throws MalformedURLException { URL aURL = new URL("http://example.com:80/docs/books/tutorial" + "/index.html?name=networking#DOWNLOADING"); System.out.println("protocol = " + aURL.getProtocol()); System.out.println("authority = " + aURL.getAuthority()); System.out.println("host = " + aURL.getHost()); System.out.println("port = " + aURL.getPort()); System.out.println("path = " + aURL.getPath()); System.out.println("query = " + aURL.getQuery()); System.out.println("filename = " + aURL.getFile()); System.out.println("ref = " + aURL.getRef()); } } 

以下是程序显示的输出:

 protocol = http authority = example.com:80 host = example.com // name of website port = 80 path = /docs/books/tutorial/index.html query = name=networking filename = /docs/books/tutorial/index.html?name=networking ref = DOWNLOADING 

因此,通过使用aURL.getHost()您可以获得网站名称。 要忽略子域,可以使用"."将其拆分"." 因此它变为aURL.getHost().split(".")[0]以获得唯一的名称。

正则表达式可以帮助您:

  String str = "www.google.co.in"; String [] res = str.split("(\\.|//)+(?=\\w)"); System.out.println(res[1]); 

正则表达式是表示一组字符串的一种方式。 该集由与表达式匹配的任何字符串组成。 在上面的代码中,用作split参数的字符串是匹配的正则表达式:Any“。” 后跟一个字母数字文本或“//”后跟一个字母数字文本。 所以这些“。” 和“//”子串是用于将字符串分割成部分的分隔符,是第一个用于站点名称的分隔符。

在“www.google.co.in”中,字符串将以这种方式分割: goole, co, in 。 由于解决方案是使用spit数组的第一个元素,结果是: google

我发现了类似的内容。 虽然有些不同。

 http://www.yahoo.com => Yahoo http://www.google.co.in => Google http://in.com => In.com Offers Videos, News, Photos, Celebs, Live TV Channels..... http://india.gov.in/ => National Portal of India https://in.yahoo.com/ => Yahoo India http://philotheoristic.tumblr.com/ => Philotheoristic https://in.movies.yahoo.com/ => Yahoo India Movies - Bollywood News, Movie Reviews & Hindi Movie Videos 

这是代码

 public class TitleExtractor { /* the CASE_INSENSITIVE flag accounts for * sites that use uppercase title tags. * the DOTALL flag accounts for sites that have * line feeds in the title text */ private static final Pattern TITLE_TAG = Pattern.compile("\\(.*)\\", Pattern.CASE_INSENSITIVE|Pattern.DOTALL); /** * @param url the HTML page * @return title text (null if document isn't HTML or lacks a title tag) * @throws IOException */ public static String getPageTitle(String url) throws IOException { URL u = new URL(url); URLConnection conn = u.openConnection(); // ContentType is an inner class defined below ContentType contentType = getContentTypeHeader(conn); if (!contentType.contentType.equals("text/html")) return null; // don't continue if not HTML else { // determine the charset, or use the default Charset charset = getCharset(contentType); if (charset == null) charset = Charset.defaultCharset(); // read the response body, using BufferedReader for performance InputStream in = conn.getInputStream(); BufferedReader reader = new BufferedReader(new InputStreamReader(in, charset)); int n = 0, totalRead = 0; char[] buf = new char[1024]; StringBuilder content = new StringBuilder(); // read until EOF or first 8192 characters while (totalRead < 8192 && (n = reader.read(buf, 0, buf.length)) != -1) { content.append(buf, 0, n); totalRead += n; } reader.close(); // extract the title Matcher matcher = TITLE_TAG.matcher(content); if (matcher.find()) { /* replace any occurrences of whitespace (which may * include line feeds and other uglies) as well * as HTML brackets with a space */ return matcher.group(1).replaceAll("[\\s\\<>]+", " ").trim(); } else return null; } } /** * Loops through response headers until Content-Type is found. * @param conn * @return ContentType object representing the value of * the Content-Type header */ private static ContentType getContentTypeHeader(URLConnection conn) { int i = 0; boolean moreHeaders = true; do { String headerName = conn.getHeaderFieldKey(i); String headerValue = conn.getHeaderField(i); if (headerName != null && headerName.equals("Content-Type")) return new ContentType(headerValue); i++; moreHeaders = headerName != null || headerValue != null; } while (moreHeaders); return null; } private static Charset getCharset(ContentType contentType) { if (contentType != null && contentType.charsetName != null && Charset.isSupported(contentType.charsetName)) return Charset.forName(contentType.charsetName); else return null; } /** * Class holds the content type and charset (if present) */ private static final class ContentType { private static final Pattern CHARSET_HEADER = Pattern.compile("charset=([-_a-zA-Z0-9]+)", Pattern.CASE_INSENSITIVE|Pattern.DOTALL); private String contentType; private String charsetName; private ContentType(String headerValue) { if (headerValue == null) throw new IllegalArgumentException("ContentType must be constructed with a not-null headerValue"); int n = headerValue.indexOf(";"); if (n != -1) { contentType = headerValue.substring(0, n); Matcher matcher = CHARSET_HEADER.matcher(headerValue); if (matcher.find()) charsetName = matcher.group(1); } else contentType = headerValue; } } } 

使用这个类很简单:

  String title = TitleExtractor.getPageTitle("http://en.wikipedia.org/"); System.out.println(title); 

链接在这里:

http://www.gotoquiz.com/web-coding/programming/java-programming/how-to-extract-titles-from-web-pages-in-java/

我希望它对你有帮助。

没有任何可能的方法从url中找出有效的网站名称。 但是,如果您尝试剪切url字符串的特定部分,可以通过字符串操作执行此操作,如下所示

 if(url.endsWith("co.in"){ website = url.substring(indexOfLostThirdDot, indexofco.in) }