如何根据给定的.proto编写有效的解码文件,从.pb读取

根据这个问题的答案,我认为我已经为我的.pb文件提供了“错误的解码器”。

这是我正在尝试解码的数据 。

这是我的.proto文件 。

基于Java教程文档中提供的ListPeople.java示例,我尝试编写类似于开始挑选数据的东西,我写道:

import cc.refectorie.proj.relation.protobuf.DocumentProtos.Document; import cc.refectorie.proj.relation.protobuf.DocumentProtos.Document.Sentence; import java.io.FileInputStream; import java.io.IOException; import java.io.PrintStream; public class ListDocument { // Iterates though all people in the AddressBook and prints info about them. static void Print(Document document) { for ( Sentence sentence: document.getSentencesList() ) { for(int i=0; i < sentence.getTokensCount(); i++) { System.out.println(" getTokens(" + i + ": " + sentence.getTokens(i) ); } } } // Main function: Reads the entire address book from a file and prints all // the information inside. public static void main(String[] args) throws Exception { if (args.length != 1) { System.err.println("Usage: ListPeople ADDRESS_BOOK_FILE"); System.exit(-1); } // Read the existing address book. Document addressBook = Document.parseFrom(new FileInputStream(args[0])); Print(addressBook); } } 

但是当我运行时,我得到了这个错误

 Exception in thread "main" com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag. at com.google.protobuf.InvalidProtocolBufferException.invalidEndTag(InvalidProtocolBufferException.java:94) at com.google.protobuf.CodedInputStream.checkLastTagWas(CodedInputStream.java:174) at com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:194) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:210) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:215) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49) at cc.refectorie.proj.relation.protobuf.DocumentProtos$Document.parseFrom(DocumentProtos.java:4770) at ListDocument.main(ListDocument.java:40) 

所以,正如我上面所说,我认为这与我没有正确定义解码器有关。 有没有办法查看我正在尝试使用的.proto文件,并找出一种方法来读取所有数据?

有没有办法看看那个.proto文件,看看我做错了什么?

这些是我想要阅读的文件的前几行:

 Ü &/guid/9202a8c04000641f8000000003221072&/guid/9202a8c04000641f80000000004cfd50NA"Ö S/m/vinci8/data1/riedel/projects/relation/kb/nyt1/docstore/2007-joint/1850511.xml.pb„€€€øÿÿÿÿƒ€€€øÿÿÿÿ"PERSON->PERSON"'inverse_false|PERSON|on bass and|PERSON"/inverse_false|with|PERSON|on bass and|PERSON|on"7inverse_false|, with|PERSON|on bass and|PERSON|on drums"$inverse_false|PERSON|IN NN CC|PERSON",inverse_false|with|PERSON|IN NN CC|PERSON|on"4inverse_false|, with|PERSON|IN NN CC|PERSON|on drums"`str:Dave[NMOD]->|PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]Barry"]str:Dave[NMOD]->|PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]on"Rstr:Dave[NMOD]->|PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]|PERSON|[PMOD]->[ADV]->[ROOT]<-[PRD]<-[PMOD]|PERSON|->-><-<-with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]Barry"Adep:PERSON|[PMOD]->[ADV]->[ROOT]<-[PRD]<-[PMOD]"dir:PERSON|->-><-<-"Pstr:PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]on"Adep:PERSON|[PMOD]->[ADV]->[ROOT]<-[PRD]<-[PMOD]"dir:PERSON|->-><-<-"Estr:PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]PERSON"'inverse_false|PERSON|on bass and|PERSON"/inverse_false|with|PERSON|on bass and|PERSON|on"7inverse_false|, with|PERSON|on bass and|PERSON|on drums"$inverse_false|PERSON|IN NN CC|PERSON",inverse_false|with|PERSON|IN NN CC|PERSON|on"4inverse_false|, with|PERSON|IN NN CC|PERSON|on drums"cstr:Dave[NMOD]->|PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]Barry"`str:Dave[NMOD]->|PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]on"Ustr:Dave[NMOD]->|PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]|PERSON|[PMOD]->[NMOD]->[NULL]<-[NMOD]<-[PMOD]|PERSON|->-><-<-with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]Barry"Cdep:PERSON|[PMOD]->[NMOD]->[NULL]<-[NMOD]<-[PMOD]"dir:PERSON|->-><-<-"Sstr:PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]on"Cdep:PERSON|[PMOD]->[NMOD]->[NULL]<-[NMOD]<-[PMOD]"dir:PERSON|->-><-<-"Hstr:PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON*ÊTonight he brings his energies and expertise to the Miller Theater for the festival 's thrilling finale : a reunion of the 1970s Sam Rivers Trio , with Dave Holland on bass and Barry Altschul on drums .â &/guid/9202a8c04000641f80000000004cfd50&/guid/9202a8c04000641f8000000003221072NA"Ù 

编辑


这是另一个用于解析这些文件的研究员的文件,所以我被告知,我可以使用它吗?

 package edu.stanford.nlp.kbp.slotfilling.multir; import java.io.BufferedInputStream; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStream; import java.util.ArrayList; import java.util.Collection; import java.util.HashMap; import java.util.HashSet; import java.util.List; import java.util.Map; import java.util.Set; import java.util.zip.GZIPInputStream; import edu.stanford.nlp.kbp.slotfilling.classify.MultiLabelDataset; import edu.stanford.nlp.kbp.slotfilling.common.Log; import edu.stanford.nlp.kbp.slotfilling.multir.DocumentProtos.Relation; import edu.stanford.nlp.stats.ClassicCounter; import edu.stanford.nlp.stats.Counter; import edu.stanford.nlp.util.ErasureUtils; import edu.stanford.nlp.util.HashIndex; import edu.stanford.nlp.util.Index; /** * Converts Hoffmann's data in protobuf format to our MultiLabelDataset * @author Mihai * */ public class ProtobufToMultiLabelDataset { static class RelationAndMentions { String arg1; String arg2; Set posLabels; Set negLabels; List mentions; public RelationAndMentions(String types, String a1, String a2) { arg1 = a1; arg2 = a2; String [] rels = types.split(","); posLabels = new HashSet(); for(String r: rels){ if(! r.equals("NA")) posLabels.add(r.trim()); } negLabels = new HashSet(); // will be populated later mentions = new ArrayList(); } }; static class Mention { List features; public Mention(List feats) { features = feats; } } public static void main(String[] args) throws Exception { String input = args[0]; InputStream is = new GZIPInputStream( new BufferedInputStream (new FileInputStream(input))); toMultiLabelDataset(is); is.close(); } public static MultiLabelDataset toMultiLabelDataset(InputStream is) throws IOException { List relations = toRelations(is, true); MultiLabelDataset dataset = toDataset(relations); return dataset; } public static void toDatums(InputStream is, List<List<Collection>> relationFeatures, List<Set> labels) throws IOException { List relations = toRelations(is, false); toDatums(relations, relationFeatures, labels); } private static void toDatums(List relations, List<List<Collection>> relationFeatures, List<Set> labels) { for(RelationAndMentions rel: relations) { labels.add(rel.posLabels); List<Collection> mentionFeatures = new ArrayList<Collection>(); for(int i = 0; i < rel.mentions.size(); i ++){ mentionFeatures.add(rel.mentions.get(i).features); } relationFeatures.add(mentionFeatures); } assert(labels.size() == relationFeatures.size()); } public static List toRelations(InputStream is, boolean generateNegativeLabels) throws IOException { // // Parse the protobuf // // all relations are stored here List relations = new ArrayList(); // all known relations (without NIL) Set relTypes = new HashSet(); Map<String, Map<String, Set>> knownRelationsPerEntity = new HashMap<String, Map<String,Set>>(); Counter labelCountHisto = new ClassicCounter(); Relation r = null; while ((r = Relation.parseDelimitedFrom(is)) != null) { RelationAndMentions relation = new RelationAndMentions( r.getRelType(), r.getSourceGuid(), r.getDestGuid()); labelCountHisto.incrementCount(relation.posLabels.size()); relTypes.addAll(relation.posLabels); relations.add(relation); for(int i = 0; i < r.getMentionCount(); i ++) { DocumentProtos.Relation.RelationMentionRef mention = r.getMention(i); // String s = mention.getSentence(); relation.mentions.add(new Mention(mention.getFeatureList())); } for(String l: relation.posLabels) { addKnownRelation(relation.arg1, relation.arg2, l, knownRelationsPerEntity); } } Log.severe("Loaded " + relations.size() + " relations."); Log.severe("Found " + relTypes.size() + " relation types: " + relTypes); Log.severe("Label count histogram: " + labelCountHisto); Counter slotCountHisto = new ClassicCounter(); for(String e: knownRelationsPerEntity.keySet()) { slotCountHisto.incrementCount(knownRelationsPerEntity.get(e).size()); } Log.severe("Slot count histogram: " + slotCountHisto); int negativesWithKnownPositivesCount = 0, totalNegatives = 0; for(RelationAndMentions rel: relations) { if(rel.posLabels.size() == 0) { if(knownRelationsPerEntity.get(rel.arg1) != null && knownRelationsPerEntity.get(rel.arg1).size() > 0) { negativesWithKnownPositivesCount ++; } totalNegatives ++; } } Log.severe("Found " + negativesWithKnownPositivesCount + "/" + totalNegatives + " negative examples with at least one known relation for arg1."); Counter mentionCountHisto = new ClassicCounter(); for(RelationAndMentions rel: relations) { mentionCountHisto.incrementCount(rel.mentions.size()); if(rel.mentions.size() > 100) Log.fine("Large relation: " + rel.mentions.size() + "\t" + rel.posLabels); } Log.severe("Mention count histogram: " + mentionCountHisto); // // Detect the known negatives for each source entity // if(generateNegativeLabels) { for(RelationAndMentions rel: relations) { Set negatives = new HashSet(relTypes); negatives.removeAll(rel.posLabels); rel.negLabels = negatives; } } return relations; } private static MultiLabelDataset toDataset(List relations) { int [][][] data = new int[relations.size()][][]; Index featureIndex = new HashIndex(); Index labelIndex = new HashIndex(); Set [] posLabels = ErasureUtils.<Set []>uncheckedCast(new Set[relations.size()]); Set [] negLabels = ErasureUtils.<Set []>uncheckedCast(new Set[relations.size()]); int offset = 0, posCount = 0; for(RelationAndMentions rel: relations) { Set pos = new HashSet(); Set neg = new HashSet(); for(String l: rel.posLabels) { pos.add(labelIndex.indexOf(l, true)); } for(String l: rel.negLabels) { neg.add(labelIndex.indexOf(l, true)); } posLabels[offset] = pos; negLabels[offset] = neg; int [][] group = new int[rel.mentions.size()][]; for(int i = 0; i < rel.mentions.size(); i ++){ List sfeats = rel.mentions.get(i).features; int [] features = new int[sfeats.size()]; for(int j = 0; j < sfeats.size(); j ++) { features[j] = featureIndex.indexOf(sfeats.get(j), true); } group[i] = features; } data[offset] = group; posCount += posLabels[offset].size(); offset ++; } Log.severe("Creating a dataset with " + data.length + " datums, out of which " + posCount + " are positive."); MultiLabelDataset dataset = new MultiLabelDataset( data, featureIndex, labelIndex, posLabels, negLabels); return dataset; } private static void addKnownRelation(String arg1, String arg2, String label, Map<String, Map<String, Set>> knownRelationsPerEntity) { Map<String, Set> myRels = knownRelationsPerEntity.get(arg1); if(myRels == null) { myRels = new HashMap<String, Set>(); knownRelationsPerEntity.put(arg1, myRels); } Set mySlots = myRels.get(label); if(mySlots == null) { mySlots = new HashSet(); myRels.put(label, mySlots); } mySlots.add(arg2); } } 

更新; 这里的困惑有两点:

  • 根对象是Relation ,而不是Document (实际上,只使用了RelationRelationMentionRef
  • pb文件实际上是多个对象,每个对象都以varint分隔,即以其长度为前缀表示为varint

因此, Relation.parseDelimitedFrom应该可以工作。 手动处理 ,我得到:

 test-multiple.pb, 96678 Relation objects parsed testNegative.pb, 94917 Relation objects parsed testPositive.pb, 1950 Relation objects parsed trainNegative.pb, 63596 Relation objects parsed trainPositive.pb, 4700 Relation objects parsed 

旧; 过时的; 探索:

我提取了你的4个文件并通过一个小试验台运行它们:

  ProcessFile("testNegative.pb"); ProcessFile("testPositive.pb"); ProcessFile("trainNegative.pb"); ProcessFile("trainPositive.pb"); 

其中ProcessFile首先将前10个字节转储为hex,然后尝试通过ProtoReader处理它。 结果如下:

 Processing: testNegative.pb dc 16 0a 26 2f 67 75 69 64 2f > Document Unexpected end-group in source data; this usually means the source data is corru pt 

是的; 同意; DC是线型4(端组),场27; 你的文档没有定义字段27,即使它确实如此:从一个端组开始是没有意义的。

 Processing: testPositive.pb d5 0f 0a 26 2f 67 75 69 64 2f > Document 250: Fixed32, Unexpected field 14: Fixed32, Unexpected field 6: String, Unexpected field 6: Variant, Unexpected field Unexpected end-group in source data; this usually means the source data is corru pt 

在这里,我们无法在hex转储中看到有问题的数据,但又一次:初始字段看起来与您的数据完全不同,读者很容易确认数据已损坏。

 Processing: trainNegative.pb d1 09 0a 26 2f 67 75 69 64 2f > Document 154: Fixed64, Unexpected field 7: Fixed64, Unexpected field 6: Variant, Unexpected field 6: Variant, Unexpected field Unexpected end-group in source data; this usually means the source data is corru pt 

与上面相同。

 Processing: trainPositive.pb cf 75 0a 26 2f 67 75 69 64 2f > Document 1881: 7, Unexpected field Invalid wire-type; this usually means you have over-written a file without trunc ating or setting the length; see http://stackoverflow.com/q/2152978/23354 

CF 75是带有线型7的双字节变量(在规范中未定义)。

你的数据真的很垃圾。 抱歉。


并且通过评论(在gz解压缩之后)的一轮test-multiple.pb:

 Processing: test-multiple.pb dc 16 0a 26 2f 67 75 69 64 2f > Document Unexpected end-group in source data; this usually means the source data is corru pt 

这与testNegative.pb完全相同,因此出于完全相同的原因而失败。

我知道已经有两年多了,但在这里我提供了一种在python中读取这个分隔协议缓冲区的通用方法。 你提到的函数: parseDelimitedFrom ,在协议缓冲区的python实现中不可用。 但对于可能需要它的人来说,这里有一个小的解决方法。 此代码改编自: https : //www.datadoghq.com/blog/engineering/protobuf-parsing-in-python/

 def read_serveral_pbfs(filename, class_of_pb): result = [] with open(filename, 'rb') as f: buf = f.read() n = 0 while n < len(buf): msg_len, new_pos = _DecodeVarint32(buf, n) n = new_pos msg_buf = buf[n:n+msg_len] n += msg_len read_data = class_of_pb() read_data.ParseFromString(msg_buf) result.append(read_data) return result 

以及使用OP的一个文件的用法示例:

 import Document_pb2 from google.protobuf.internal.encoder import _VarintBytes from google.protobuf.internal.decoder import _DecodeVarint32 filename = "trainPositive.pb" relations = read_serveral_pbfs(filename,Document_pb2.Relation)