Data generation for Elasticsearch

kucera.jan.cz · November 12, 2015, 9:52am

Hello everyone,

I am currently working on document generator for Elasticsearch since I miss such functionality in my previous work. The ideal workflow of using generator is like this:

Choose index/type which should be scanned -> tool generate metadata about existing values in index
User can modify metadata to help generate more precise data (optional step)
User execute document generation with specification where data should be sent and how many (1k, 1M, etc)

I would like to ask community for few questions.
A) Do you know about any active project doing similar thing?
B) What features you would like to have in such generator?

Best regards
Jan

warkolm · November 12, 2015, 11:46pm

There are a few testing oriented ones like jmeter and YCSB that may fall into this category, but I dunno if there is anything specific for ES.

However it'd be great to see. I'd be looking for something where you could define a payload size and a document count as some basic functionality, then generate some random text data with upper and lower case to test different analysis.
It'd be great if you could provide a dictionary to use as well.

softwaredoug · November 13, 2015, 12:07am

Not specific to Elasticsearch, but I really like faker for generating fake data like phone numbers, social security nos, names, and other PII for search testing.

kucera.jan.cz · November 13, 2015, 7:47am

Hello Mark,

thanks for showing me YCSB, I didn't know about such tool exists. I will definitely look at it's capabilities. So far it seems that it focuses more on executing read/writes to calculate cloud performance, however data are static any anyhow related to your use case.

The focus of my project should be to generating data based on existing index. It's based on my experience when we designed our data structure and found out that several customer's data sets are way different than our expectations. The idea is that you can run the generator to create metadata info about customer's environment, take this info and recreate data in your development environment.

kucera.jan.cz · November 13, 2015, 7:52am

Hi Doug,

would you rather prefer to use fake data (like faker does) or rather use existing data from your index? Currently the project should work that it collects existing tokens from certain field and generate document field by materialistically combining tokens together (however randomization is planned as well).

Thanks for your reply and useful feedback

kucera.jan.cz · November 30, 2015, 8:29pm

Hello everyone,

I have finally got my data generation project to alpha version and I would like to share it with you. The project's name is esBench and you can find it on GitHub.

Current esBench abilities:

Analyzes Elasticsearch indices and create metadata information about fields and it's values
Based on the metadata creates documents by combining tokens and values from metadata file
Supports generation for core types, objects and nested types
Perform multi-threaded document insertion to Elasticsearch
Perform synchronized insertion from multiple machines as Hazelcast cluster

At this stage esBench lacks proper output display and documentation is still in the process, however project is ready for early adopters who can influence API and features for future releases.
In any case I would really appreciate any feedback (negative is really welcomed, there is always something to improve).

Thanks in advance
Jan

kucera.jan.cz · December 11, 2015, 8:13am

I am more than happy to announce that version 0.0.2 of esBench has been released.

More info at: https://github.com/kucera-jan-cz/esBench

Best regards
Jan

milind_parikh · May 14, 2016, 7:43pm

mdm-gen.js (https://github.com/milindparikh/mdm-gen) produces functional data suitable for ingesting into ElasticSearch through Logstash ; including nested data

Topic		Replies	Views
Is there any tool which will create dummy document based on index mapping for testing? Elasticsearch	4	1203	November 4, 2022
JSLT or other tool for Elasticsearch search templates management/generation Elasticsearch	1	188	May 30, 2023
Quickly populating indexes with mock values Elasticsearch	3	971	February 2, 2017
Best practice in generating document ID Elasticsearch	2	9735	July 6, 2017
What algorithm is ElasticSearch create Document _Id based on?Could somebody answer me，plz Elasticsearch	3	6638	February 28, 2019

Data generation for Elasticsearch

Related Topics