What is the exactly content like in a index file in elasticsearch


(Daniel C S Yeh) #1

Dear all,

I have fully understood the mechanism of indexing a document in elasticsearch, like the example here

https://www.elastic.co/guide/en/elasticsearch/guide/current/inverted-index.html

but one more question is what is the exactly content in the index file?

and how can I open it to see the dictionary table?

Many thanks


(Nik Everett) #2

While it doesn't answer all of your questions I like this page:
http://lucene.apache.org/core/4_5_0/core/org/apache/lucene/codecs/lucene45/package-summary.html


(Jason Wee) #3
	IndexReader indexReader = DirectoryReader.open(FSDirectory.open(new File("clean/index.termrange")));

	// all fields
	SlowCompositeReaderWrapper.wrap(indexReader).getFieldInfos().forEach(x -> System.out.println(x.name));
	
	Terms terms = SlowCompositeReaderWrapper.wrap(indexReader).terms("contents");
	
	TermsEnum iter = terms.iterator(null);
	
	BytesRef byteRef = null;
    while((byteRef = iter.next()) != null) {
        String term = new String(byteRef.bytes, byteRef.offset, byteRef.length);
        
        System.out.format("%-10s:%2d:%2d %n", term, indexReader.docFreq(new Term("contents", term)), indexReader.totalTermFreq(new Term("contents", term)));
    }
    
	System.out.println(terms.getSumTotalTermFreq());
	System.out.println(terms.getSumDocFreq());
	System.out.println(indexReader.getSumTotalTermFreq("contents"));

lucene index files are binary.. hexdump cannot really print something useful, you need to write code to read the index from the directory. then loop through the term and get the frequency from the index reader object.

hth

jason


(Colin Goodheart-Smithe) #4

I haven't used it for a while but Luke used to be a good UI app for inspecting Lucene indices and the seems to support inspecting Elasticsearch indices now too.

Disclaimer: I have not run Luke since it was hosted on Google Code.

IMPORTANT: I would definitely copy the indices to a new folder (outside of you ES directory) and point Luke at that copy to make sure it doesn't corrupt the index somehow.


(Nik Everett) #5

I don't know if this solution is the "right way to do it in general but it will work. And teach you some things.

You probably want to be careful of that new String call there - I don't think it'll work properly. I'd go with BytesRef.utf8ToString or UnicodeUtil.UTF8toUTF16.


(system) #6