Search Performance of index with large documents (PDF's)


(Petter) #1

Hello!

I'm new to Elasticsearch and would like some help with tuning and tips on how to increase the performance of my index.

Currently i have around 4500 documents in the index and a size on disc about 34GB that consists of PDF's with some metadata. The PDF's are indexed using the Mapper Attachment plugin and are from 10MB to 150MB each, some are bigger up to 250MB.

My problem is that search operations take long time, sometimes up to multiple seconds, i'm filtering on up to 0 - 7 fields, ordering on 2 plus a Query String query against the documents text (base64 encoded), title, and some other meta fields. Im also using pagination for all of the results (up to 450 pages of 10 documents each) and highlighting to show what part that was hit. I guess this is part of my problem but i can't really get away from it.

The server got 8GB of RAM and ElasticSearch has the ES_HEAP_SIZE set to 2GB, i'm guessing this is the other part of my problem and that the bottleneck is here, right? Don't know how much i can increase it either since it's running a web server as well. Server can of course be upgraded.

I haven't changed any settings regarding shards from the default values. It's currently hosted in Azure but i don't know right now if i got SSD's or spinning disc's

I can post my mappings if that helps.

Happy for any input and explanation of why its going slow. I'm not surprised that it does but i would like to understand why :slightly_smiling:


(Petter) #2

Bump, Any piece of help would be awsome!


(Adrien Grand) #3

Can you sow use what a typical query looks like and try to run the nodes hit threads API while search requests are being performed so that we can get an idea of the bottleneck? On elasticsearch 2.2, you could also use the query profiler to identify the slow bits of your queries.


(Petter) #4

Thanks for your reply, didn't know of the threads API, will read about that. I tried doing a simple text search wich i guess is the most common one, the response time is around 10 seconds the first time but gets better with repeated searches.

Both the query and the result from the /_nodes/hot_threads API can be seen in this paste bin dump http://pastebin.com/1xMJhffs

Sadly i don't run 2.2, we are using 1.7.3 right now, will upgrading in itself give any performance gains?


(system) #5