Avoiding mapping explosion and structuring JSON

XOR · May 25, 2017, 1:23am

Hello,

I've been looking for an answer for a couple days and hopefully someone here can help.

I am new to Elasticsearch and love the concept and function. I am going to use it for a project that will be collecting some system status information for analysis.

Most of the JSON coming in will be pretty fixed in terms of key values for the headers. However, the results data will be very different at times. I ran into my first problem when I started sending in data to test out mapping and I got an indexing error when I hit 1000.

I had a mapping explosion. I know what went wrong. What I'm looking to do is re-structure the incoming data to avoid this situation.

The problem is outlined below. Basically I have a results key, and under that key I returned json that has varying keys. So systems_logs, ... will be different.

In one case, I returned system PIDs and the path to the process for another kind of metric ("pid":"process_path"). e.g.

"444":"/sbin/system_daemon",
"6583":"/usr/local/bin/user_daemon_1",
"14567":"/usr/local/bin/user_daemon_2"

etc.

You had a lot of PIDs as keys and of course on a Unix host you can have 65k+ so the index was growing way too large. This also of course makes mapping impossible to go in and put in all of these possibilities.

What I'm looking to do is come up with a clean way to make the results available, but not lose the ability to have unique key: pair values.

I know that is not really possible with Elastic to have unique keys constantly flowing in. So, I'm looking for a workaround that will give me some way to flatten this kind of result so it is more elastic friendly.

A sample JSON is below to show what is coming in now. Any recommendations on how I can change the input JSON for efficiency and search ability is appreciated. Keep in mind under the "logs" heading that the list of log files can be large and ever changing which is not going to work well the way I have it now.

Thank you for all the work on Elastic.

{
    "header": {
        "status": "ok”,
	    “status_msg": "ok",
        "ip_addr": "192.168.1.1"
    },
    "data": {
        "status": "ok",
        "status_msg": "ok",
        "results": {
            “logs”: {
                “system_logs”: [
                    “/log/syslog”
                ],
                “www_logs”: [
                    “/log/www/www_log”,
                    “/log/www/www_log.2”,
                    “/log/www/www_log.3”,
                    “/log/www/www_log.4”
                ]
            }
        },
        "name": "log_grabber”,
        "start_time": "2017-05-24T21:40:18.455588Z",
        "end_time": "2017-05-24T21:40:18.481798Z"
    }
}

ahmadimt07 · May 25, 2017, 3:47am

Can you please share the error you got??

Also, you can use filebeat and filter some of the data..

XOR · May 25, 2017, 4:06am

I can't use filebeat. The system is not doing just log collection, but a variety of other tasks. Logs are just one thing it may do and it's a minor task.

The error I initially had was exceeding the index (I don't recall the exact error). But basically the way to resolve it was to increase the index value which I don't want to do because it is just papering over the real problem.

The problem I have revolves around how the JSON data is coming in and how I should re-format it to not overrun Elasticsearch with a bunch of key creations from values changing. What I'm fishing for are some ideas on how I could present the kind of data above in a more friendly way for elasticsearch. Basically what changes can I make to the JSON for the results data so if I have a bunch of keys that will vary wildly over time, can I still search for them and get the parameters associated with them.

I don't know if that makes sense, but that's the problem. I have some ideas on how to resolve it in terms of having a fixed list of values that are acceptable for results keys and ensuring the API only uses them. That way I can ensure the system does not go outside the safe zone and start using keys randomly. But if there was a simpler or better way to do it I'm open to any and all ideas.

system · June 22, 2017, 4:06am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Json Variance Enumerating/Handling Elasticsearch	1	428	August 9, 2017
Elastic Mapping explosion Elasticsearch	12	4455	February 11, 2019
Project advice (mapping, analysis, basic architecture ) Elasticsearch	1	409	April 23, 2017
DOS attack Elasticsearch with Mappings Elasticsearch	4	1004	July 6, 2017
Mapping gets messed up in result of indexing a Json document Elasticsearch	6	585	July 5, 2017

Avoiding mapping explosion and structuring JSON

Related topics