Avoiding mapping explosion and structuring JSON


(XOR) #1

Hello,

I've been looking for an answer for a couple days and hopefully someone here can help.

I am new to Elasticsearch and love the concept and function. I am going to use it for a project that will be collecting some system status information for analysis.

Most of the JSON coming in will be pretty fixed in terms of key values for the headers. However, the results data will be very different at times. I ran into my first problem when I started sending in data to test out mapping and I got an indexing error when I hit 1000.

I had a mapping explosion. I know what went wrong. What I'm looking to do is re-structure the incoming data to avoid this situation.

The problem is outlined below. Basically I have a results key, and under that key I returned json that has varying keys. So systems_logs, ... will be different.

In one case, I returned system PIDs and the path to the process for another kind of metric ("pid":"process_path"). e.g.

"444":"/sbin/system_daemon",
"6583":"/usr/local/bin/user_daemon_1",
"14567":"/usr/local/bin/user_daemon_2"

etc.

You had a lot of PIDs as keys and of course on a Unix host you can have 65k+ so the index was growing way too large. This also of course makes mapping impossible to go in and put in all of these possibilities.

What I'm looking to do is come up with a clean way to make the results available, but not lose the ability to have unique key: pair values.

I know that is not really possible with Elastic to have unique keys constantly flowing in. So, I'm looking for a workaround that will give me some way to flatten this kind of result so it is more elastic friendly.

A sample JSON is below to show what is coming in now. Any recommendations on how I can change the input JSON for efficiency and search ability is appreciated. Keep in mind under the "logs" heading that the list of log files can be large and ever changing which is not going to work well the way I have it now.

Thank you for all the work on Elastic.

{
    "header": {
        "status": "ok”,
	    “status_msg": "ok",
        "ip_addr": "192.168.1.1"
    },
    "data": {
        "status": "ok",
        "status_msg": "ok",
        "results": {
            “logs”: {
                “system_logs”: [
                    “/log/syslog”
                ],
                “www_logs”: [
                    “/log/www/www_log”,
                    “/log/www/www_log.2”,
                    “/log/www/www_log.3”,
                    “/log/www/www_log.4”
                ]
            }
        },
        "name": "log_grabber”,
        "start_time": "2017-05-24T21:40:18.455588Z",
        "end_time": "2017-05-24T21:40:18.481798Z"
    }
}

(Imteyaz Ahmad) #2

Can you please share the error you got??

Also, you can use filebeat and filter some of the data..


(XOR) #3

I can't use filebeat. The system is not doing just log collection, but a variety of other tasks. Logs are just one thing it may do and it's a minor task.

The error I initially had was exceeding the index (I don't recall the exact error). But basically the way to resolve it was to increase the index value which I don't want to do because it is just papering over the real problem.

The problem I have revolves around how the JSON data is coming in and how I should re-format it to not overrun Elasticsearch with a bunch of key creations from values changing. What I'm fishing for are some ideas on how I could present the kind of data above in a more friendly way for elasticsearch. Basically what changes can I make to the JSON for the results data so if I have a bunch of keys that will vary wildly over time, can I still search for them and get the parameters associated with them.

I don't know if that makes sense, but that's the problem. I have some ideas on how to resolve it in terms of having a fixed list of values that are acceptable for results keys and ensuring the API only uses them. That way I can ensure the system does not go outside the safe zone and start using keys randomly. But if there was a simpler or better way to do it I'm open to any and all ideas.


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.