Hi!
I´m new to Elastic search but we have been using Lucene for many years and are on the way to move to Elastic instead.
We index the most important data in an ERP system (items, customers, orders etc) which ends up to almost 300 different indexes. Each can easliy contain 10 million records. With the current implementation we have a problem with certain searches that goes cross mulitple indexes like equal to value x in index A and value y+z in index B and value w in index C. The result is maybe 20 hits that equals all of these conditions. The issue is that each sub search can generate a few million hits. Our current implementation supports both this search as well as a role-up hirarchial search (starting in index a, takes hits and moves on to search in index b but only on record matching first search) . This works fine but not with these volumes at each search. To add to complexity the primary keys of the idexes are not identical but parts of the keys are in all indexes primary keys. The data is also for most highly changable and need to as close as possible on-line synced (delay of 30s to 2 minutes are ok). We have tried to explore the path of virtual indexes as well as creating a true index mergin the needed data elements into one and then use delta updates of the index with only changed columns. Unfortunately this far we have not managed to get a solution that also performs well enough.
Does anyone here have any idea of if and how we can solve this in Elastic search?
Is there a reason the data needs to be separated across 300 indices? I.e. could you consolidate into a single (set) of indices?
You'll face similar problems in ES if the setup stays the same. While ES has no problem searching multiple indices and merging the results for display, there's not a good way to maintain any type of relation between indices. if you could consolidate into a smaller set of indices (e.g. one customer always ends up in one index, with all the associated types together) you could potentially lean on Parent/Child or Nested mappings.
That's not to say you have to put everything in a single, giant index. You could for example use Index Rollover to continually split new indices after n documents. That way your indices keep scaling out as more data comes in.
It does introduce some complexity, however. Say UserA is indexed into foo-1, and at a later time the index is rolled over so you now have foo-1 and foo-2. To make changes to UserA, you'll need to search for the user instead of performing a direct GET (because you don't know which foo index they reside in). And then you'll need to make direct modifications to UserA on that index, as well as any nested or parent/child documents.
But the important bit is that UserA and all it's associated documents are colocated in one shard, that way you can maintain relations.
Otherwise I think you'll end up with a similar scrolling search with rollup as you go.
We have been thinking about mergin the information related to each other into a larger type of index. There a mulitple issues that, at least for now, have prevented us from doing that. To give a concrete example. Items in the backend system is separated into multiple files depending on the type of data you maintain and the distribution of it based on various keys. On the highest level you maintain data on an item number level. next level is on a item warehouse level where you maintain information unique for that level but that are not applicable on a higher level. Thirdly you maintain data on a item/geographical level only valid per item and region (like country) fourthly we can add the item/language level. Creating a single index of these 4 leveles are not easy since you have to find the lowest level of keys and distribut information from higher to lower levels of the keyset. Even if we did succeed initially you have the problem with keeping the different columns in sync where for instance a change to the item level must be distributed to every record for same item number. Since it is very expensive to retrieve all information at every change it must be using a delta update of just affected elements in the index.
Hopefuly this gives you a bit more info to why we haven't been able to solve it yet.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.