I have a question about indexing strategy for my project. Iam a novice and elasticsearch and i need advice and help about my case.
In the project, im using primary database postgres, and i have about 10 000 objects called "audit", by day we could have about 100 changes in the objects, and about 30 new objects. Many metadatas (about 10) should be indexed and every audit object have some attachements (must be too indexed, we would like Ingest attachment plugin).
We don't have a requirement to do search in real time way.
Im designing a batch (scheduler) to do bulk indexing every hours.
However, i don't know the best strategy to choose.
Option 1 : Bulk insert to a new index : so i query database to have audit, and then do bulk insert in new index, by using alias we could switch in the index. I will delete then the old index.
Option 2: Bulk update and insert: so i query database to have audit with date (creation, modification.) before last batch indexing execution, and i then execute bulk insert /update.. in same index (or in a new index?). Maybe the bulk update is good for not again indexing attachments when there are no changes (memory cost)
I have to find too the deleted audit in my primary database to delete it in the index
Option 3: Real time indexing: at every audit changes , a message is publised on broker, and indexing in executed (maybe group message to bulk)
The real time solution is more difficult to manage and need broker.
From my needs, i think option 2 is enough, but don't have enough experience to be sure that i will not have performance issues, .... maybe there some other options???
As you have a few changes/creations per day, like 1% or 2%, I don't think reindexing the full database is the best option here (Option 1).
I definitely prefer the Option 3. This is exactly what I implemented 12 years ago for the same exact use case. I was managing "control data" (similar to your audit data) and attachments.
The reason I prefer Option 3 is that when I have something modified in my application, it's basically already in memory. So instead of reading again the database to create the object I want to index, I just have to serialize it as Json and send it to Elasticsearch.
If you are using Java, you can directly send your java bean to Elasticsearch using the BulkIngester helper class.
You don't really need a broker here as you can send it directly to Elasticsearch at the same time as you are sending it to your Database.
If you can't change the application layer, then indeed the option 2 looks better in that case.
The only performance issue I saw in the past was the load on the database. No issue on elasticsearch side
About the option 3, i will also need to load the attachments from S3 storage, because i will need to index the attachments.
I don't need broker but i will need to relaunched the indexing in case of indexing elasticsearch error (network problems..) for a audit.
The broker (for example Kafka with) will help to do it (if messages are not committed), if i only request Elasticsearch with Http (synch or asynch) i will need also need to design a batch to do the retry indexing process for this audit.