Hi.
I am currently writing my master thesis on database comparions, and achieving rapid search results when dealing with large amounts of data (billions of records).
The reason I chose this topic was because I was quite amazed at the speed of the website 'haveibeenpwned.com', and other similar websites that allow you to check if your personal data is exposed publicly. These sites contain billions of records (up to 10 billion), and the response times for querys (including wildcard searches) are in general under 3 seconds. Having worked with databases myself on other projects, I found this quite interesting and was something I wanted to learn more about. Due to this, and the fact I had to limit the scope of my master thesis, I have decided to look into leaked (breached) data, similar to these sites. The data I use however is not real data, and is automatically generated by a Python script, but is of the same structure.
Based on my scope and limitations, I have decided to compare the following database/search-engine systems:
- MySQL Percona
- MongoDB Percona
- Elasticsearch
- Splunk
The data I am indexing is structured (100.000 entries equals ~ 20 MB in size, per chunk in bulk insertion), and I am having some trouble with slow indexing with the bulk API. This is however not a big problem right now as I can leave my script on over night for insertions, but if anyone has suggestions on how to speed up the bulk insertions with the Python library please let me know.
During my testing, Elasticsearch has provided the best results so far, and I am almost having trouble 'slowing it down' in terms of query response time (using default settings). Due to this, I have to purchase some new hardware for more space. As I am dealing with "cold" data (if I have understood the term correctly), I was wondering if I can achieve good results with a HDD, or if I should stick to SSD's?
If I had unlimited resources and money I would buy better hardware, but as a student I can not afford this at the moment. Additionally, an idea behind my thesis was to find optimization methods to improve performance of single nodes for personal projects, where one can use an old desktop etc as a server.
Sorry if this post got a little bit messy, but I wanted to get all the background information in regarding my question. All help and suggestions are highly appreciated.