使用java从文本中删除url

如何删除文本示例中的URL

String str="Fear psychosis after #AssamRiots - http://www.google.com/LdEbWTgD http://www.yahoo.com/mksVZKBz";

使用正则表达式？

我想删除文本中的所有url。但它没有用，我的代码是：

 String pattern = "(http(.*?)\\s)"; Pattern pt = Pattern.compile(pattern); Matcher namemacher = pt.matcher(input); if (namemacher.find()) { str=input.replace(namemacher.group(0), ""); }

输入包含url的String

 private String removeUrl(String commentstr) { String urlPattern = "((https?|ftp|gopher|telnet|file|Unsure|http):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)"; Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE); Matcher m = p.matcher(commentstr); int i = 0; while (m.find()) { commentstr = commentstr.replaceAll(m.group(i),"").trim(); i++; } return commentstr; }

好吧，你没有提供任何关于你的文字的信息，所以假设你的文字看起来像这样： "Some text here http://www.example.com some text there" ，你可以这样做：

 String yourText = "blah-blah"; String cleartext = yourText.replaceAll("http.*?\\s", " ");

这将删除以“http”开头的所有序列，直到第一个空格字符。

您应该阅读String类的Javadoc。它会让你清楚。

你如何定义URL？您可能不仅希望过滤http：//而且还要过滤https：//以及其他协议，如ftp：//，rss：//或自定义协议。

也许这个正则表达式可以完成这项工作：

[\S]+://[\S]+

说明：

一个或多个非空格
后跟字符串“：//”
后跟一个或多个非空格

请注意，如果您的url包含＆和\等字符，则上述答案将无效，因为replaceAll无法处理这些字符。对我有用的是删除新字符串变量中的那些字符，然后从m.find（）的结果中删除这些字符，并在我的新字符串变量上使用replaceAll。

 private String removeUrl(String commentstr) { // rid of ? and & in urls since replaceAll can't deal with them String commentstr1 = commentstr.replaceAll("\\?", "").replaceAll("\\&", ""); String urlPattern = "((https?|ftp|gopher|telnet|file|Unsure|http):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)"; Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE); Matcher m = p.matcher(commentstr); int i = 0; while (m.find()) { commentstr = commentstr1.replaceAll(m.group(i).replaceAll("\\?", "").replaceAll("\\&", ""),"").trim(); i++; } return commentstr; }

m.group(0)应该替换为空字符串而不是m.group(i) ，其中i在每次调用m.find()递增，如上面的一个答案中所述。

 private String removeUrl(String commentstr) { String urlPattern = "((https?|ftp|gopher|telnet|file|Unsure|http):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)"; Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE); Matcher m = p.matcher(commentstr); StringBuffer sb = new StringBuffer(commentstr.length); while (m.find()) { m.appendReplacement(sb, ""); } return sb.toString(); }

如果您可以继续使用python，那么您可以使用这些代码找到更好的解决方案，

 import re text = " then ftp and mailto and gopher and file ftp://ideone.com/K3Cut rthen you " text = re.sub(r"ftp\S+", "", result) print(result)

正如@ Ev0oD所提到的，代码工作得很完美，除了我正在处理的以下推文： RT @_Val83_: The cast of #ThorRagnarok playing "Ragnarok Paper Scissors" #TomHiddleston #MarkRuffalo (https://t.co /k9nYBu3QHu)

令牌将被删除的位置： commentstr = commentstr.replaceAll(m.group(i),"").trim();

我遇到了以下错误：

java.util.regex.PatternSyntaxException: Unmatched closing ')' near index 22

其中m.group(i)是https://t.co /k9nYBu3QHu ）“

使用java从文本中删除url

替换与正则表达式中的字符不匹配的字符

我的可执行jar文件的安全性如何

使用Guava的Optional与@XmlAttribute

使用Hibernate映射数组

Java Serialization vs JSON vs XML

为clojure源文件启用UTF-8编码

Restlet javax.net.ssl.SSLHandshakeException：null cert chain

如何使用Java / Eclipse创建Windows .exe（独立可执行文件）？

可以在IIS上托管Java EE Web应用程序

使用junitPlatform时设置系统属性