I want to get all my data (logs) out of elasticsearch with the elastic package for r. I use the scrolling api (size 10000), but it takes forever (9 minutes) to get 2,5 million documents. I have one node with a SSD and 16 GB ram. 8GB are reserved for elasticsearch. Indicies are on a monthly basis and one shard per index.
CPU usage is arround 20% and heap usage between 3-4 GB
Any ideas what the problem could be or is this normal ?
What does your scroll query look like? What is the size of your documents?
My query is quite long:
NOT uriStem: images AND NOT uriStem: includes AND NOT uriStem: favicon.ico AND NOT uriStem:style AND NOT uriStem: *sta AND NOT uriStem: *png AND NOT uriStem: *zip AND NOT uriStem: *txt AND NOT uriStem: *csv AND NOT uriStem: pdf AND NOT userAgent: www.bla.com AND NOT userAgent: www.blub.com AND NOT userAgent: www.bing.com AND NOT userAgent: www.baidu.com AND NOT uriStem:robots.txt AND requestHost: tada AND NOT uriStem: test AND NOT uriStem: leer.asp AND NOT leer.htm AND NOT uriStem: portal.asp AND NOT uriStem.keyword:"/" AND NOT uriStem: main.asp AND NOT uriStem: Popup AND NOT uriStem: Calendar.asp
But it makes no diffrence if my query is just: serverName: blub
The rest parameter for the scrolling are : scroll = "1m", size=10000
The size of my documents is:
Leading wildcard queries are very, very inefficient, so several of these combined with all the NOT clauses explain your poor performance. If these criteria are known upfront, could you perhaps analyse each record with respect to these at ingest time and add a simple flag in order to simplify this and make it much more efficient?
Yeah I thought as well that the wildcards are the problem, but why do I get similar performance with a query like this:
It takes the same time with the same result size.
You meant flag them while processing in logstash ? sadly thats not an option, because those params can change with time
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.