Greetings! I'm trying to implement a custom mapping where a byte array
field is stored (but not indexed for searching), and the byte array is
available as side data to a custom filter. However I run into two
challenges:
Problem 1: Storing a byte array
Setting .store(true) on a BinaryFieldMapper eventually leads to an
exception down the stack:
Fields with BytesRef values cannot be indexed
at org.apache.lucene.document.Field.(Field.java:222)
(I am coding against 0.20.1)
Problem 2: Reading a stored field
It seems that the FieldDataCache (I see it used in many of Elasticsearch's
filter implementations) cannot read stored-but-not-indexed fields; it only
reads indexed fields. What's the best way around this (assuming I can
store the field)?
I can work around both of these by byte-64 encoding the bytes and using a
StringFieldMapper. But this uses much more space than necessary and also
takes a performance hit around what will be an operation that's iterated
hundreds of thousands of times, so is unacceptable as a solution.
Alternatively, as the byte array will never be used for querying, I assumethat storing it is preferred over indexing, but I'm not clear around
whether stored vs. indexed fields are written to different files or
precisely what is the tradeoff between the two.
Background and context: This is to fix a bug in the geo_shape filter; the
byte array field is a serialization of a polygon, and the custom filter is
the geo_shape filter. Details at
I'm out of ideas, so any insight is very much appreciated!
Yes, it seems there is a missing part.
org.elasticsearch.index.field.data.bytes.ByteFieldDataType always looks
into the term list for indexed terms to load into the cache, because it
uses FieldDataLoader, where store-only fields are not checked. I'm
confident this will be improved, also with Lucene 4 now arriving, which has
improvements for filling field caches.
This is great information Jörg -- thanks. It looks like the
FieldDataLoader could be tweaked to also load stored fields by calling
reader.document(docId) for each doc, but I don't trust my shallow
understanding of the Elasticsearch codebase to make that change myself.
It seems for the time being that storing a byte array is not a possible
solution. What is the reason not to actually index (not store) the byte
array via ByteFieldMapper, and then use the FieldDataCache? (keeping in
mind each byte array might be a couple kilobytes in size)
Or, is there an alternative way to accomplish my goal of checking a byte
array inside a Filter implementation?
Thanks!
On Thursday, December 13, 2012 2:21:57 AM UTC-8, Jörg Prante wrote:
Yes, it seems there is a missing part.
org.elasticsearch.index.field.data.bytes.ByteFieldDataType always looks
into the term list for indexed terms to load into the cache, because it
uses FieldDataLoader, where store-only fields are not checked. I'm
confident this will be improved, also with Lucene 4 now arriving, which has
improvements for filling field caches.
Jeffrey, I can only make wild guesses, kimchy and David Smiley are doing
the hard work and they are much more familiar with the code. I only worked
through the geoshape code, trying to understand, and I'm still learning.
To your question, maybe it has not been considered because geoshapes
require this implementation path for the first time?
I totally agree it's not easy to make design decisions and change the code
to add new implementation paths, but I'm sure if you can find a geoshape
filter cache solution that is useful and is easy to use, it will be more
than welcome!
Thanks Jorg. For now we will have to post-process documents returned from
Elasticsearch to filter out false positives (at the cost of considerable
added latency, largely since it requires parsing the GeoJSON). I presume
that storing byte arrays and filtering on stored fields will become useful
for totally different reasons in Elasticsearch beyond geo and hope that the
Elasticsearch API will soon expose this new Lucene 4.0 feature (at which
point I will try again to contribute the more efficient filtering solution
back to ES core!).
On Thursday, December 13, 2012 3:04:12 PM UTC-8, Jörg Prante wrote:
Jeffrey, I can only make wild guesses, kimchy and David Smiley are doing
the hard work and they are much more familiar with the code. I only worked
through the geoshape code, trying to understand, and I'm still learning.
To your question, maybe it has not been considered because geoshapes
require this implementation path for the first time?
I totally agree it's not easy to make design decisions and change the code
to add new implementation paths, but I'm sure if you can find a geoshape
filter cache solution that is useful and is easy to use, it will be more
than welcome!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.