Data generation for Elasticsearch

Hello everyone,

I am currently working on document generator for Elasticsearch since I miss such functionality in my previous work. The ideal workflow of using generator is like this:

  1. Choose index/type which should be scanned -> tool generate metadata about existing values in index
  2. User can modify metadata to help generate more precise data (optional step)
  3. User execute document generation with specification where data should be sent and how many (1k, 1M, etc)

I would like to ask community for few questions.
A) Do you know about any active project doing similar thing?
B) What features you would like to have in such generator?

Best regards
Jan

There are a few testing oriented ones like jmeter and YCSB that may fall into this category, but I dunno if there is anything specific for ES.

However it'd be great to see. I'd be looking for something where you could define a payload size and a document count as some basic functionality, then generate some random text data with upper and lower case to test different analysis.
It'd be great if you could provide a dictionary to use as well.

1 Like

Not specific to Elasticsearch, but I really like faker for generating fake data like phone numbers, social security nos, names, and other PII for search testing.

1 Like

Hello Mark,

thanks for showing me YCSB, I didn't know about such tool exists. I will definitely look at it's capabilities. So far it seems that it focuses more on executing read/writes to calculate cloud performance, however data are static any anyhow related to your use case.

The focus of my project should be to generating data based on existing index. It's based on my experience when we designed our data structure and found out that several customer's data sets are way different than our expectations. The idea is that you can run the generator to create metadata info about customer's environment, take this info and recreate data in your development environment.

Hi Doug,

would you rather prefer to use fake data (like faker does) or rather use existing data from your index? Currently the project should work that it collects existing tokens from certain field and generate document field by materialistically combining tokens together (however randomization is planned as well).

Thanks for your reply and useful feedback

Hello everyone,

I have finally got my data generation project to alpha version and I would like to share it with you. The project's name is esBench and you can find it on GitHub.

Current esBench abilities:

  • Analyzes Elasticsearch indices and create metadata information about fields and it's values
  • Based on the metadata creates documents by combining tokens and values from metadata file
  • Supports generation for core types, objects and nested types
  • Perform multi-threaded document insertion to Elasticsearch
  • Perform synchronized insertion from multiple machines as Hazelcast cluster

At this stage esBench lacks proper output display and documentation is still in the process, however project is ready for early adopters who can influence API and features for future releases.
In any case I would really appreciate any feedback (negative is really welcomed, there is always something to improve).

Thanks in advance
Jan

I am more than happy to announce that version 0.0.2 of esBench has been released.

More info at: https://github.com/kucera-jan-cz/esBench

Best regards
Jan

mdm-gen.js (https://github.com/milindparikh/mdm-gen) produces functional data suitable for ingesting into ElasticSearch through Logstash ; including nested data