过滤（搜索和替换）InputStream中的字节数组

我有一个InputStream，它将html文件作为输入参数。我必须从输入流中获取字节。

我有一个字符串： "XYZ" 。我想将此字符串转换为字节格式，并检查我从InputStream获取的字节序列中是否存在匹配项。如果有的话，我必须用匹配序列替换匹配的其他字符串。

有没有人可以帮我这个？我使用正则表达式来查找和替换。但是找到并替换字节流，我不知道。

以前，我使用jsoup来解析html并替换字符串，但由于一些utf编码问题，当我这样做时，该文件似乎已损坏。

TL; DR：我的问题是：

是一种在Java中的原始InputStream中以字节格式查找和替换字符串的方法吗？

不确定您是否选择了解决问题的最佳方法。

也就是说，我不喜欢（并且有政策不要）用“不要”回答问题所以这里…

看看FilterInputStream 。

从文档：

FilterInputStream包含一些其他输入流，它将其用作其基本数据源， 可能会沿途转换数据或提供其他function。

写一下这是一个有趣的练习。这是一个完整的例子：

 import java.io.*; import java.util.*; class ReplacingInputStream extends FilterInputStream { LinkedList inQueue = new LinkedList(); LinkedList outQueue = new LinkedList(); final byte[] search, replacement; protected ReplacingInputStream(InputStream in, byte[] search, byte[] replacement) { super(in); this.search = search; this.replacement = replacement; } private boolean isMatchFound() { Iterator inIter = inQueue.iterator(); for (int i = 0; i < search.length; i++) if (!inIter.hasNext() || search[i] != inIter.next()) return false; return true; } private void readAhead() throws IOException { // Work up some look-ahead. while (inQueue.size() < search.length) { int next = super.read(); inQueue.offer(next); if (next == -1) break; } } @Override public int read() throws IOException { // Next byte already determined. if (outQueue.isEmpty()) { readAhead(); if (isMatchFound()) { for (int i = 0; i < search.length; i++) inQueue.remove(); for (byte b : replacement) outQueue.offer((int) b); } else outQueue.add(inQueue.remove()); } return outQueue.remove(); } // TODO: Override the other read methods. }

示例用法

 class Test { public static void main(String[] args) throws Exception { byte[] bytes = "hello xyz world.".getBytes("UTF-8"); ByteArrayInputStream bis = new ByteArrayInputStream(bytes); byte[] search = "xyz".getBytes("UTF-8"); byte[] replacement = "abc".getBytes("UTF-8"); InputStream ris = new ReplacingInputStream(bis, search, replacement); ByteArrayOutputStream bos = new ByteArrayOutputStream(); int b; while (-1 != (b = ris.read())) bos.write(b); System.out.println(new String(bos.toByteArray())); } }

给定字符串"Hello xyz world"的字节，它打印：

 Hello abc world

以下方法可行，但我对性能的影响不大。

用InputStreamReader包装InputStreamReader ，
然后用一个替换字符串的FilterReader包装InputStreamReader
使用ReaderInputStream包装FilterReader 。

选择适当的编码至关重要，否则流的内容将被破坏。

如果你想使用正则表达式替换字符串，那么你可以使用我的工具Streamflyer ，它是FilterReader一个方便的替代品。您将在Streamflyer的网页上找到字节流的示例。希望这可以帮助。

我也需要这样的东西，并决定推出自己的解决方案，而不是使用@aioobe上面的例子。看看代码。您可以从maven central中提取库，或者只复制源代码。

这就是你如何使用它。在这种情况下，我使用嵌套实例替换两个模式，两个修复dos和mac行结尾。

new ReplacingInputStream(new ReplacingInputStream(is, "\n\r", "\n"), "\r", "\n");

这是完整的源代码：

 /** * Simple FilterInputStream that can replace occurrances of bytes with something else. */ public class ReplacingInputStream extends FilterInputStream { // while matching, this is where the bytes go. int[] buf=null; int matchedIndex=0; int unbufferIndex=0; int replacedIndex=0; private final byte[] pattern; private final byte[] replacement; private State state=State.NOT_MATCHED; // simple state machine for keeping track of what we are doing private enum State { NOT_MATCHED, MATCHING, REPLACING, UNBUFFER } /** * @param is input * @return nested replacing stream that replaces \n\r (DOS) and \r (MAC) line endings with UNIX ones "\n". */ public static InputStream newLineNormalizingInputStream(InputStream is) { return new ReplacingInputStream(new ReplacingInputStream(is, "\n\r", "\n"), "\r", "\n"); } /** * Replace occurances of pattern in the input. Note: input is assumed to be UTF-8 encoded. If not the case use byte[] based pattern and replacement. * @param in input * @param pattern pattern to replace. * @param replacement the replacement or null */ public ReplacingInputStream(InputStream in, String pattern, String replacement) { this(in,pattern.getBytes(StandardCharsets.UTF_8), replacement==null ? null : replacement.getBytes(StandardCharsets.UTF_8)); } /** * Replace occurances of pattern in the input. * @param in input * @param pattern pattern to replace * @param replacement the replacement or null */ public ReplacingInputStream(InputStream in, byte[] pattern, byte[] replacement) { super(in); Validate.notNull(pattern); Validate.isTrue(pattern.length>0, "pattern length should be > 0", pattern.length); this.pattern = pattern; this.replacement = replacement; // we will never match more than the pattern length buf = new int[pattern.length]; } @Override public int read(byte[] b, int off, int len) throws IOException { // copy of parent logic; we need to call our own read() instead of super.read(), which delegates instead of calling our read if (b == null) { throw new NullPointerException(); } else if (off < 0 || len < 0 || len > b.length - off) { throw new IndexOutOfBoundsException(); } else if (len == 0) { return 0; } int c = read(); if (c == -1) { return -1; } b[off] = (byte)c; int i = 1; try { for (; i < len ; i++) { c = read(); if (c == -1) { break; } b[off + i] = (byte)c; } } catch (IOException ee) { } return i; } @Override public int read(byte[] b) throws IOException { // call our own read return read(b, 0, b.length); } @Override public int read() throws IOException { // use a simple state machine to figure out what we are doing int next; switch (state) { case NOT_MATCHED: // we are not currently matching, replacing, or unbuffering next=super.read(); if(pattern[0] == next) { // clear whatever was there buf=new int[pattern.length]; // clear whatever was there // make sure we start at 0 matchedIndex=0; buf[matchedIndex++]=next; if(pattern.length == 1) { // edgecase when the pattern length is 1 we go straight to replacing state=State.REPLACING; // reset replace counter replacedIndex=0; } else { // pattern of length 1 state=State.MATCHING; } // recurse to continue matching return read(); } else { return next; } case MATCHING: // the previous bytes matched part of the pattern next=super.read(); if(pattern[matchedIndex]==next) { buf[matchedIndex++]=next; if(matchedIndex==pattern.length) { // we've found a full match! if(replacement==null || replacement.length==0) { // the replacement is empty, go straight to NOT_MATCHED state=State.NOT_MATCHED; matchedIndex=0; } else { // start replacing state=State.REPLACING; replacedIndex=0; } } } else { // mismatch -> unbuffer buf[matchedIndex++]=next; state=State.UNBUFFER; unbufferIndex=0; } return read(); case REPLACING: // we've fully matched the pattern and are returning bytes from the replacement next=replacement[replacedIndex++]; if(replacedIndex==replacement.length) { state=State.NOT_MATCHED; replacedIndex=0; } return next; case UNBUFFER: // we partially matched the pattern before encountering a non matching byte // we need to serve up the buffered bytes before we go back to NOT_MATCHED next=buf[unbufferIndex++]; if(unbufferIndex==matchedIndex) { state=State.NOT_MATCHED; matchedIndex=0; } return next; default: throw new IllegalStateException("no such state " + state); } } @Override public String toString() { return state.name() + " " + matchedIndex + " " + replacedIndex + " " + unbufferIndex; } }

字节流（ InputStream ）上没有任何内置的搜索和替换function。

并且，有效且正确地完成该任务的方法并不是立即显而易见的。我已经为流实现了Boyer-Moore算法，它运行良好，但需要一些时间。如果没有这样的算法，你必须采用蛮力方法，从流体中的每个位置开始寻找模式，这可能很慢。

即使您将HTML解码为文本，使用正则表达式匹配模式也可能是一个坏主意，因为HTML不是“常规”语言。

因此，即使您遇到了一些困难，我建议您继续使用原始方法将HTML解析为文档。虽然您在使用字符编码时遇到了问题，但从长远来看，修复正确的解决方案可能比审查错误的解决方案更容易。

我需要一个解决方案，但发现这里的答案产生了太多的内存和/或CPU开销。基于简单的基准测试，以下解决方案在这些术语中明显优于其他解决方案。

该解决方案特别节省内存，即使使用> GB流也不会产生可测量的成本。

也就是说，这不是零CPU成本解决方案。除了最苛刻/资源敏感的场景之外，CPU /处理时间开销可能是合理的，但开销是真实的，在评估在给定上下文中使用此解决方案的价值时应考虑。

就我而言，我们处理的最大实际文件大小约为6MB，我们看到增加了大约170毫秒的延迟，44个URL替换。这适用于在具有单个CPU份额（1024）的AWS ECS上运行的基于Zuul的反向代理。对于大多数文件（低于100KB），增加的延迟是亚毫秒。在高并发性（以及CPU争用）下，增加的延迟可能会增加，但是我们当前能够在单个节点上同时处理数百个文件，而不会产生明显的延迟影响。

我们使用的解决方案：

 import java.io.IOException; import java.io.InputStream; public class TokenReplacingStream extends InputStream { private final InputStream source; private final byte[] oldBytes; private final byte[] newBytes; private int tokenMatchIndex = 0; private int bytesIndex = 0; private boolean unwinding; private int mismatch; private int numberOfTokensReplaced = 0; public TokenReplacingStream(InputStream source, byte[] oldBytes, byte[] newBytes) { assert oldBytes.length > 0; this.source = source; this.oldBytes = oldBytes; this.newBytes = newBytes; } @Override public int read() throws IOException { if (unwinding) { if (bytesIndex < tokenMatchIndex) { return oldBytes[bytesIndex++]; } else { bytesIndex = 0; tokenMatchIndex = 0; unwinding = false; return mismatch; } } else if (tokenMatchIndex == oldBytes.length) { if (bytesIndex == newBytes.length) { bytesIndex = 0; tokenMatchIndex = 0; numberOfTokensReplaced++; } else { return newBytes[bytesIndex++]; } } int b = source.read(); if (b == oldBytes[tokenMatchIndex]) { tokenMatchIndex++; } else if (tokenMatchIndex > 0) { mismatch = b; unwinding = true; } else { return b; } return read(); } @Override public void close() throws IOException { source.close(); } public int getNumberOfTokensReplaced() { return numberOfTokensReplaced; } }

过滤（搜索和替换）InputStream中的字节数组

示例用法

如何通过TCP连接发送字节数组（java编程）

Java：从.txt文件LINE BY LINE中读取字节

Java中字节数组的可用内存

如何在java中将blob图像转换为binarybase64？

将100多个字符串转换为字节数组时的java.lang.OutOfMemoryError

GUID到ByteArray

如何在Java中将byte 数组中的原始字节内容打印到stdout？

将Java字符串转换为字节数组

将wav音频格式字节数组转换为浮点数

Google Protobuf ByteString vs. Byte