下载整个网页

有一些方法可以使用HTMLEditorKit下载整个网页。 但是,我需要下载需要滚动的整个网页才能加载其整个内容 。 这项技术最常通过与Ajax捆绑在一起的JavaScript实现。

问:有没有办法欺骗目标网页, 使用Java code ,以下载其完整内容?

问题2:如果只有Java才能实现这一点,那么它是否可以与JavaScript结合使用?

简单的通知,我写的:

 public class PageDownload { public static void main(String[] args) throws Exception { String webUrl = "..."; URL url = new URL(webUrl); URLConnection connection = url.openConnection(); InputStream is = connection.getInputStream(); InputStreamReader isr = new InputStreamReader(is); BufferedReader br = new BufferedReader(isr); HTMLEditorKit htmlKit = new HTMLEditorKit(); HTMLDocument htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument(); HTMLEditorKit.Parser parser = new ParserDelegator(); HTMLEditorKit.ParserCallback callback = htmlDoc.getReader(0); parser.parse(br, callback, true); for (HTMLDocument.Iterator iterator = htmlDoc.getIterator(HTML.Tag.IMG); iterator.isValid(); iterator.next()) { AttributeSet attributes = iterator.getAttributes(); String imgSrc = (String) attributes.getAttribute(HTML.Attribute.SRC); if (imgSrc != null && (imgSrc.endsWith(".jpg") || (imgSrc.endsWith(".jpeg")) || (imgSrc.endsWith(".png")) || (imgSrc.endsWith(".ico")) || (imgSrc.endsWith(".bmp")))) { try { downloadImage(webUrl, imgSrc); } catch (IOException ex) { System.out.println(ex.getMessage()); } } } } private static void downloadImage(String url, String imgSrc) throws IOException { BufferedImage image = null; try { if (!(imgSrc.startsWith("http"))) { url = url + imgSrc; } else { url = imgSrc; } imgSrc = imgSrc.substring(imgSrc.lastIndexOf("/") + 1); String imageFormat = null; imageFormat = imgSrc.substring(imgSrc.lastIndexOf(".") + 1); String imgPath = null; imgPath = "..." + imgSrc + ""; URL imageUrl = new URL(url); image = ImageIO.read(imageUrl); if (image != null) { File file = new File(imgPath); ImageIO.write(image, imageFormat, file); } } catch (Exception ex) { ex.printStackTrace(); } } } 

是的,您可以通过Java代码欺骗网页下载您的本地人。 您无法通过Java Script下载HTMl静态内容。 JavaScript不提供您像Java提供的那样创建文件。

 import java.io.File; import java.io.FileOutputStream; import java.io.IOException; import java.io.InputStream; import java.net.HttpURLConnection; import java.net.URL; public class HttpDownloadUtility { private static final int BUFFER_SIZE = 4096; /** * Downloads a file from a URL * @param fileURL HTTP URL of the file to be downloaded * @param saveDir path of the directory to save the file * @throws IOException */ public static void downloadFile(String fileURL, String saveDir) throws IOException { URL url = new URL(fileURL); HttpURLConnection httpConn = (HttpURLConnection) url.openConnection(); int responseCode = httpConn.getResponseCode(); // always check HTTP response code first if (responseCode == HttpURLConnection.HTTP_OK) { String fileName = ""; String disposition = httpConn.getHeaderField("Content-Disposition"); String contentType = httpConn.getContentType(); int contentLength = httpConn.getContentLength(); if (disposition != null) { // extracts file name from header field int index = disposition.indexOf("filename="); if (index > 0) { fileName = disposition.substring(index + 10, disposition.length() - 1); } } else { // extracts file name from URL fileName = fileURL.substring(fileURL.lastIndexOf("/") + 1, fileURL.length()); } System.out.println("Content-Type = " + contentType); System.out.println("Content-Disposition = " + disposition); System.out.println("Content-Length = " + contentLength); System.out.println("fileName = " + fileName); // opens input stream from the HTTP connection InputStream inputStream = httpConn.getInputStream(); String saveFilePath = saveDir + File.separator + fileName; // opens an output stream to save into file FileOutputStream outputStream = new FileOutputStream(saveFilePath); int bytesRead = -1; byte[] buffer = new byte[BUFFER_SIZE]; while ((bytesRead = inputStream.read(buffer)) != -1) { outputStream.write(buffer, 0, bytesRead); } outputStream.close(); inputStream.close(); System.out.println("File downloaded"); } else { System.out.println("No file to download. Server replied HTTP code: " + responseCode); } httpConn.disconnect(); } } 

使用HtmlUnit库获取所有文本和图像/ css文件。

HTMLUnit [link] htmlunit.sourceforge.net

1)下载文本内容使用下面链接的代码

所有文本内容[link] 如何使用HtmlUnit获取HTML页面

特定标记,例如span [link] 如何使用HtmlUnit获取特定跨度之间的文本

2)在[link]下方获取图像/文件如何告诉HtmlUnit的WebClient下载图像和CSS?

您可以使用Selenium Webdriver java类实现此目的…

https://code.google.com/p/selenium/wiki/GettingStarted

通常,webdriver用于测试,但它能够模拟用户向下滚动页面,直到页面停止更改,然后您可以使用Java代码将内容保存到文件中。

你可以使用IDM的抓取器来做到这一点。

这应该有所帮助: https : //www.internetdownloadmanager.com/support/idm-grabber/grabber_wizard.html