Hello! I'm currently developing an autocomplete feature for my mobile marketplace, based on my products dataset, stored in my main database.
I currently have about 100K documents loaded on my products index using logstash. The products are managed by houndreds of vendors, so everyday I have a lot of additions and removals in my database that I need to sync with app search. Updating the index on a daily basis seems fine for me.
I made some research regarding how to keep indexes updated on ES, and found two different approaches. Since a massive deletion with logstash seems to be resource intensive, one approach is to create a new index every day and dump the older one, and another is to use the new index lifecycle management.
That may be fine when working directly with ES, but I found that AppSearch creates many indexes with custom names. I'm pretty sure that if I try to manipulate those indexes directly in ES I will break AppSearch.
On the other hand, I didn't found anything about index management in AppSearch.
What strategy do you recommend to keep the indexes updated on a daily basis?
Just in case, I'm running the whole ELK in my own docker environment.
Just so I understand your question correctly: you have a set of products which are added, updated, or deleted daily, and you’re wondering how to manage those indexes. Are you directly ingesting those documents into App Search using the App Search Logstash Plugin, or directly into Elasticsearch?
I ask because index management for the actual document stores should be pretty hands-off with App Search. You certainly shouldn’t have to cycle through indexes for your documents. You simply add, update, or remove documents using the documents endpoint of the App Search API. You can do that daily, or even as the documents are changed in your main database.
You can batch documents in sets of 100 per API call.
Having said that, App Search users do need to manually manage the logging and analytics indexes for now — I’m happy to dive into more details if that was the intent of your initial question, but I’d like to make sure we’ve sorted out your document ingestion first.
Hello Nick, thank you for taking the time to try to help me. I also excuse myself in advance for any mistakes in my English writing since it's not my native language.
I'm currently ingesting the data only through Logstash, using the odbc input and the AppSearch output plugins. I'd like to keep the interface between my database and AS/ES as simple as possible, that's why I was planning a logstash-only process, without having to make additional calls to the API outside Logstash.
Besides, I'm developing in Java, and sadly the java client was deprecated, so I'm making my own.
Since logstash allows me to insert and update records but not delete them, I thought a couple possibilities I want to share with you, ordered by convenience:
My original idea. Ingest normally through Logstash and make a daily index wipe or cycle using ES' index lifecycle management or a similar built-in solution provided by AS/ES
Enable logstash to delete specific records using an odbc input with its own statement + appsearch output plugin. Maybe in another pipeline, different from the ingestion one
Be able to wipe the entire Engine content with Logstash prior to ingest the whole index data
Add an extra column to the index to identify the deleted records, and enable logstash to somehow delete all of them
Invoke a method in the AS API to wipe out all the engine's content
Add an extra column to the index to identify the deleted records and use it to filter the queries. I don't like this solution since I will keep storing the deleted products with a performance ans storage penalty
How are you updating your database to begin with? You might be able to keep things simple and skip a step by making a call to App Search the same time you/your users make changes to your database.
So rather than:
User -> Database -> Logstash -> App Search
You could have two calls on every update: one for the database, and one for App Search.
User -> Database + App Search.
I preffer using logstash because it lets me decouple my app from the data ingest of elasticsearch/appsearch. My API doesn't know that appsearch needs to be updated, it just make queries. Besides there are some recurrent calculations I need to upload into appsearch that depend on schedule and not on user actions. Currently I have 2 pipelines with different queries and frequencies to handle those scenarios.
If I should discard logstash and make direct API calls, besides the need of a scheduler, I should have to use some decoupling method between my API and appsearch's API like message queues, and do all the programming. Since I'm the only programmer, logstash seems far more easy to make it work and to maintain. In fact, that's the same reason why I chose appsearch instead of ElasticSearch.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.