Performance optimization (indexing & search) for single node (master thesis)

Hi.

I am currently writing my master thesis on database comparions, and achieving rapid search results when dealing with large amounts of data (billions of records).

The reason I chose this topic was because I was quite amazed at the speed of the website 'haveibeenpwned.com', and other similar websites that allow you to check if your personal data is exposed publicly. These sites contain billions of records (up to 10 billion), and the response times for querys (including wildcard searches) are in general under 3 seconds. Having worked with databases myself on other projects, I found this quite interesting and was something I wanted to learn more about. Due to this, and the fact I had to limit the scope of my master thesis, I have decided to look into leaked (breached) data, similar to these sites. The data I use however is not real data, and is automatically generated by a Python script, but is of the same structure.

Based on my scope and limitations, I have decided to compare the following database/search-engine systems:

  • MySQL Percona
  • MongoDB Percona
  • Elasticsearch
  • Splunk

The data I am indexing is structured (100.000 entries equals ~ 20 MB in size, per chunk in bulk insertion), and I am having some trouble with slow indexing with the bulk API. This is however not a big problem right now as I can leave my script on over night for insertions, but if anyone has suggestions on how to speed up the bulk insertions with the Python library please let me know.

During my testing, Elasticsearch has provided the best results so far, and I am almost having trouble 'slowing it down' in terms of query response time (using default settings). Due to this, I have to purchase some new hardware for more space. As I am dealing with "cold" data (if I have understood the term correctly), I was wondering if I can achieve good results with a HDD, or if I should stick to SSD's?

If I had unlimited resources and money I would buy better hardware, but as a student I can not afford this at the moment. Additionally, an idea behind my thesis was to find optimization methods to improve performance of single nodes for personal projects, where one can use an old desktop etc as a server.

Sorry if this post got a little bit messy, but I wanted to get all the background information in regarding my question. All help and suggestions are highly appreciated.

Hey,

very happy to see you using Elasticsearch as part of your master thesis!

A couple of annotations here. First you dataset is rather small. A size of 20MB will fit easily into memory of any system you are using, so testing for performance might be hard (and will be much more interesting once all your data does not fit into memory anymore. This is also the point in time where SSD disks will become crucial for fast responses).

Regarding automatically generated data. Systems might behave extremely different on artificial vs. real life data (compression of posting lists for example is an interesting topic, if every name or email address in your data is unique). I understand the need for random data, but be sure to resemble some real word cases here :slight_smile:

Regarding improving the performance of single nodes. I think the biggest improvements could be in improving the mapping of your data in order to reduce the amount of data stored on disk or in memory (which brings me back to the rather small total size).

Regarding the indexing performance. Given the size of the data indexing should be super fast and done in no time, even on laptops. Everything higher than a minute sounds like another issue going on to be honest. Maybe you are not using the bulk API or maybe the python script suffers from another issue. You can use the nodes stats API to find out how much time was needed on the ES side of things.

Hope this helps as a starter.

--Alex

Thank you very much for the response.

I noticed a mistake in my original post that I have corrected; I have a few Terrabytes of data, but the size of one "chunk" that I pass to the bulk API is ~ 20 MB (equals to approx. 100.000 records) as of right now. I have tried to adjust between 1.000 - 100.000 records per bulk insertion chunk for my script to index data faster.

Ill look into the nodes stats API.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.