Force Elasticsearch to Separate Lucene Indices with the same Field Name

(Harlin) #1

I have 2 types of data within each index, but each type has the same field names because I want to be able to search both types at once on the same field. That being said I also want to be able to search them separatly as the volume of one type is far greater than the other. So I want searches on only the less voluminous type to return quickly where as searches on both or on the more voluminous type will take longer and have more hit.
Unfortunately, Elasticsearch will always index values with the same field name into the same lucene inverted index, whether or not the values come from separate types. Therefore if I search only the smaller type, I will still have to search through all of the data in the index to find my hits.

for example if I have type RARE with fields "user" and "name"
as well as type COMMON with the same field "user" and "name",
then I do a search for only type RARE on field "user", Lucene will still have to search through all the COMMON data to find its hit for type RARE.

I was wondering if the is a way to separate these indices in Lucene without having to change the field name. Or if I can alias a field somehow so I don't actually have to change my query?


(Zachary Tong) #2

Kinda sorta. When you search for the RARE type, internally ES is building a filtered query which is filtering on the _type field. So what's happening is that Lucene is masking off a portion of the index and searching only that.

That said, you are correct this can be suboptimal, since the filter in the COMMON case is relatively dense. This is one of the reasons types and mappings were cleaned up for 2.0, to make this relationship more clear.

The solution to your question is pretty simple: create two separate indices for RARE and COMMON. You can still search both simultaneously (/RARE,COMMON/my_type/_search) or you can search them individually. You could use an alias to make it appear as a single index for convenience (/COMBINED/my_type/_search).

But as individual indices their filter caches are local to the data and don't suffer from the "other" data. And you can shard each differently for performance reasons, e.g. COMMON can have more shards than RARE because it is larger. It also can help with scoring, since the term/doc frequencies won't "bleed" over into the other type when searching independently.

(system) #3