Just a heads-up, this is the first time I post something on this forum.
I'm a data scientist working at a start-up and we use elasticsearch to store, and query our data. We have used ES with much success over the last year. However, we do experience some problems, and because we are no experts in elasticsearch to say the least, we would like to ask for help, and see if/how these problems we are experiencing could be resolved. If these questions are better suited elsewhere, please let me know. The three main things we are encountering/struggling with are the following:
When querying arabic (language(s) which reads from RTL instead of LTR), we retrieve 'surprising' results. We wonder how we could deal with this/change the indexing on this?
In our case, fast rerieval of the data is most important, and, currently, we are not writing/inserting a lot of data into the ES DB. However, ideally, we do a big insertion/write every week, or we do small insertions/writes every time somebody changes something in the table. We were wondering how in our case, we can optimize JVM tweaking for this task.
Lastly, in some cases, we retrieve a lot of records (100K+), and in this case, scrolling becomes really slow, we were wondering if there are workarounds/solutions for this?
Please let me know if anything is unclear or if there are any questions. Thanks in advance!
Can you please take some more time to properly explain the problem? What does surprising mean to you? What do you expect vs. what is returned? A fully reproducible example might help a lot in this case.
Can you explain why tweaking the JVM is your first idea here? Have you considered tweaking the index operations first? Again, adding some more context is really important. How much performance are you missing, when doing that one big index?
Again, adding context and examples are crucial. How are you executing your scrolls, which size, when does it become slow, how many shards are you querying? Which Elasticsearch version, etc...
Without context, everything here becomes a lot of guesswork and will reduce your chances of getting an answer a lot.
مدير التسويق على شبكة الإنترنت
( Digital marketing specialist)
(SEO Operations Manager)
(Web Marketing Director)
When the Arabic words contain one or multiple spaces, the ES query, instead of extracting matches, starts retrieving the whole Database. If we search for just one Arabic word, everything works fine.
Yeah, apologies for the formulation of this point/question. This is basically one of the points (JVM tweaking) where I have little experience/knowledge of, but I know it helps for "memory management", and as scrolling can be memory intensive (along with the other tasks), it occurred to me that it could help in that respect. Basically, all the tips, references, links, etc. would be helpful and very appreciated.
Please be more specific. You still should provide queries and reproducible examples in order to understand the issue and not just write text. Also, how do you know and figured out that 'the whole database is retrieved'? Can you explain, what makes you think that (I have no idea how I would try to validate that claim, but maybe you found some evidence, so I am interested in that)? Also, please paste the exact query that you are testing this (this is what I referred to as context, as this helps to understand what you are after).
My advise here: Do not tune any GC parameters, unless you fully understand what they are for. Configuring the heap size if fine. But if you run into issues with performance, this is extremely rare based on the default GC configuration. I know there is a lot of content out in the internet, but be careful about that.
The same applies as for the first point. More context is needed. From what I see based on that data, you may want to split your shards a little smaller, as you know have a single shard with 190gb in size. That said, this does not explain why your scroll search is getting slower. Can you provide the initial query as well, that you used with scroll search? Are you trying to do a full export of your data? Also, about what dimension of slow are we talking about? How fast is it at the beginning, how does it decrease? Does it hang or is it still running, albeit slow?
Also, can you explain your node setup? You only seem to have two shards, so maybe not all of your nodes in the cluster are getting utilized properly - or is this a single node setup?
We had a bug/error in our code when parsing the query, which resulted in faulty results. We fixed this, and now everything runs smoothly. But because of you expressing the need for specificity, we found the problem.
We increased the heap size, and results were already much better. And also, the results weren't as bad as we first thought.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.