如何从Java中的BufferedReader对象中提取整个内容的最佳方法是什么？

我试图通过URLConnection获取整个WebPage。

最有效的方法是什么？

我已经这样做了：

URL url = new URL("http://www.google.com/"); URLConnection connection; connection = url.openConnection(); InputStream in = connection.getInputStream(); BufferedReader bf = new BufferedReader(new InputStreamReader(in)); StringBuffer html = new StringBuffer(); String line = bf.readLine(); while(line!=null){ html.append(line); line = bf.readLine(); } bf.close();

html包含整个HTML页面。

您的方法看起来很不错，但是通过避免为每一行创建中间String对象，您可以使其更有效。

这样做的方法是直接读入临时char []缓冲区。

以下是您的代码的略微修改版本（为清晰起见，减去所有错误检查，exception处理等）：

  URL url = new URL("http://www.google.com/"); URLConnection connection; connection = url.openConnection(); InputStream in = connection.getInputStream(); BufferedReader bf = new BufferedReader(new InputStreamReader(in)); StringBuffer html = new StringBuffer(); char[] charBuffer = new char[4096]; int count=0; do { count=bf.read(charBuffer, 0, 4096); if (count>=0) html.append(charBuffer,0,count); } while (count>0); bf.close();

为了获得更高的性能，如果要经常调用此代码，您当然可以执行一些额外的操作，例如预分配字符数组和StringBuffer。

我认为这是最好的方法。页面的大小是固定的（“它就是它”），因此你无法改善内存。也许你可以在拥有它们后压缩它们，但它们在那种forms下并不是很有用。我想，最终你会想要将HTML解析为DOM树。

您为了并行化读取而做的任何事情都会使解决方案过于复杂化。

我建议使用默认大小为2048或4096的StringBuilder。

你为什么认为你发布的代码不够用？你听起来像是过早优化的罪魁祸首。

用你拥有的东西跑，晚上睡觉。

你想用获得的HTML做什么？解析它？可能很高兴知道有点像样的HTML解析器已经有一个构造函数或方法参数，它直接采用URL或InputStream因此您不必担心像这样的流性能。

假设您在上一个问题中描述了所有您想要做的事情，例如Jsoup，您可以非常容易地获得所有这些新闻链接，如下所示：

 Document document = Jsoup.connect("http://news.google.com.ar/nwshp?hl=es&tab=wn").get(); Elements newsLinks = document.select("h2.title a:eq(0)"); for (Element newsLink : newsLinks) { System.out.println(newsLink.attr("href")); }

仅在几秒钟后产生以下结果：

 http://www.infobae.com/mundo/541259-100970-0-Pinera-confirmo-que-el-rescate-comenzara-las-20-y-durara-24-y-48-horas
 http://www.lagaceta.com.ar/nota/403112/Argentina/Boudou-disculpo-con-DAIA-pero-volvio-cuestionar-medios.html
 http://www.abc.es/agencias/noticia.asp?noticia=550415
 http://www.google.com/hostednews/epa/article/ALeqM5i6x9rhP150KfqGJvwh56O-thi4VA?docId=1383133
 http://www.abc.es/agencias/noticia.asp?noticia=550292
 http://www.univision.com/contentroot/wirefeeds/noticias/8307387.shtml
 http://noticias.terra.com.ar/internacionales/ecuador-apoya-reclamo-argentino-por-ejercicios-en-malvinas,3361af2a712ab210VgnVCM4000009bf154d0RCRD.html
 http://www.infocielo.com/IC/Home/index.php?ver_nota=22642
 http://www.larazon.com.ar/economia/Cristina-Fernandez-Censo-indispensable-pais_0_176100098.html
 http://www.infobae.com/finanzas/541254-101275-0-Energeticas-llevaron-la-Bolsa-portena-ganancias
 http://www.telam.com.ar/vernota.php?tipo=N&idPub=200661&id=381154&dis=1&sec=1
 http://www.ambito.com/noticia.asp?id=547722
 http://www.canal-ar.com.ar/noticias/noticiamuestra.asp?Id=9469
 http://www.pagina12.com.ar/diario/cdigital/31-154760-2010-10-12.html
 http://www.lanacion.com.ar/nota.asp?nota_id=1314014
 http://www.rpp.com.pe/2010-10-12-ganador-del-pulitzer-destaca-nobel-de-mvll-noticia_302221.html
 http://www.lanueva.com/hoy/nota/b44a7553a7/1/79481.html
 http://www.larazon.com.ar/show/sdf_0_176100096.html
 http://www.losandes.com.ar/notas/2010/10/12/batista-siento-comodo-dieron-respaldo-520595.asp
 http://deportes.terra.com.ar/futbol/los-rumores-empiezan-a-complicar-la-vida-de-river-y-vuelve-a-sonar-gallego,a24483b8702ab210VgnVCM20000099f154d0RCRD.html
 http://www.clarin.com/deportes/futbol/Exigieron-Roman-regreso-Huracan_0_352164993.html
 http://www.el-litoral.com.ar/leer_noticia.asp?idnoticia=146622
 http://www.nuevodiarioweb.com.ar/nota/181453/Locales/C%C3%A1ncer_mama:_200_casos_a%C3%B1o_Santiago.html
 http://www.ultimahora.com/notas/367322-Funcionarios-sanitarios-capacitaran-sobre-cancer-de-mama
 http://www.lanueva.com/hoy/nota/65092f2044/1/79477.html
 http://www.infobae.com/policiales/541220-101275-0-Se-suspendio-la-declaracion-del-marido-Fernanda-Lemos
 http://www.clarin.com/sociedad/educacion/titulo_0_352164863.html

有人已经说过正则表达式绝对是解析HTML的错误工具吗？ ;）

也可以看看：

Java中HTML解析器的优缺点

您可以尝试使用apache中的commons-io（http://commons.apache.org/io/api-release/org/apache/commons/io/IOUtils.html）

 new String(IOUtils.toCharArray(connection.getInputStream()))

有一些技术考虑因素。您可能希望使用HTTPURLConnection而不是URLConnection。

HTTPURLConnection支持分块传输编码，它允许您以块的forms处理数据，而不是在开始工作之前缓冲所有内容。这可以改善用户体验。

此外，HTTPURLConnection支持持久连接。如果您要立即请求其他资源，为什么要关闭该连接？保持与Web服务器的TCP连接打开允许您的应用程序快速下载多个资源，而无需花费为每个资源建立新TCP连接的开销（延迟）。

如果响应头表示内容已压缩，请告诉服务器您支持gzip并在GZIPInputStream周围包装BufferedReader。

如何从Java中的BufferedReader对象中提取整个内容的最佳方法是什么？

也可以看看：

是缓冲读卡器线程安全吗？

Socket，BufferedReader挂起在readLine（）

缓冲读卡器没有从套接字接收数据

如何两次或多次读取BufferedReader？

了解BufferedReader如何在Java中工作

任何Java流输入库是否都保留行结束字符？

服务器在客户端 – 服务器应用程序中不接收消息

为什么Java HashMap会变慢？

套接字：BufferedReader readLine（）块

Android应用程序永远下载Web内容为字符串（WaitForGcToComplete阻止xx.xxxms导致HeapTrim？）