Porting a relational structure to Elasticsearch - Nested or Parent/Child


(Brian Gow) #1

I'm new to Elasticsearch but have done enough research to figure out how to map, enter and perform basic "relational" searches using Nested and Parent/Child structures. I believe my data is better suited to Parent/Child structures but I also desire support for aggregation in Kibana. My understanding is that in v4.4 this will be supported for Nested structures and the files are available now (https://github.com/elastic/kibana/pull/4806).

I have a lot of detailed questions but for simplicity I think it might be better if I simply give an example of my data structure and ask for the approach you would take to getting what I desire from Elasticsearch (ES) and Kibana.

My data is contained in large CSV files (~1MB/file to 30+GB/file), related to each other by an ID variable. A very simplified example of the structure for 3 CSV files is:

File1:

ID,AGE,GENDER,WAVE
1,49,M,1
2,72,F,0

File2:

ID,TIME,EVENT1
1,4/20/2095,V39
1,4/21/2095,T21
2,5/17/2094,V39
2,5/18/2094,R9
2,5/20/2094,Q20

File3:

ID,TIME,EVENT2
1,4/22/2095,P90
1,4/23/2095,T90
1,4/23/2095,E2
1,4/24/2095,RR2
2,5/18/2094,E2
2,5/18/2094,W3

A simple Elasticsearch I'd like to be able to perform is: return any ID who has WAVE=1 and EVENT1=V39 and EVENT2=E2. I don't think this will work for a purely flat structure since I wouldn't be able to express that WAVE=1, EVENT1=V39 and EVENT2=E2 must be true for the same ID. Therefore I need to use either Nested or Parent/Child structures.

A simple Kibana plot could be a pie chart of GENDER, aggregated by EVENT1. I know that flattening out my structure (so all variables (EVENT1, EVENT2, etc) are in the same CSV) would allow for this but if you imagine the above stucture but with 10,000+ ID's, each with 100's of rows of events (plus many more EVENTX variables) you realize it would be difficult to manage.

Given this example and what I'd like to get from the data do you recommend using Nested or Parent/Child structures? If I used Parent/Child structures could I get aggregated plots for variables across the CSV's (i.e. across the parents/children)? If I use Nested structures is there a good way to import the data or will I need to flatten my CSV's first and then assign the nested variables?


(Mark Harwood) #2

Rough decision process:


Which index(element) of an array are the ones that matched in highlight search results?
(Ziv) #3

That's a good decision flow :stuck_out_tongue:
Do you have ram limits?
parent/child joins are held in memory so you have to calculate or estimate the amount of relationships.

Nested documents are a bit of a pain, you'd have to make some groovy scripts to update or do application side updates to the entire document.
Also considering the amount of events you have, there is a max of 2B documents a shard can hold ( including nested).

How about flattening it this way:
Age, Gender, Wave, Time, EVENT1, EVENT2 etc.. dynamically added to the type.
ie:
{ age: 49,
gender: M,
wave: 0,
timestamp: 4/20/2095,
event1: V39
}
each event will be created this way.
If you don't have storage limits this could be easy to manage.
Also elastic search has some pretty nice compression for repeating values..


Logstash mutate filter, rename - "Exception in filterworker", "exception"=>#<IndexError: string not matched>
(Brian Gow) #4

Thanks for the replies!

Regarding the flowchart, whether we could "sensibly" fit all of our related items into a single JSON file is debatable. However if we could automate (via logstash, etc) the import of our data into a mapping where each file/spreadsheet was it's own Nested object I think it would be manageable in the end. For the example I give above a mapping which seems to be working (allowing relational searches) is:

curl -XPOST "http://localhost:9200/forum/" -d'
{
    "mappings" : {
        "subject" : {
            "properties" : {
                "id" : {"type" : "string"},
				"file1" : {
					"type" : "nested",
					"include_in_parent" : false,
                    "properties" : {
						"age" : {"type" : "string", "index":"not_analyzed"},
                        "gender" : {"type" : "string", "index":"not_analyzed"},
                        "wave" : {"type" : "string", "index":"not_analyzed"}
						}
					},
                "file2" : {
                    "type" : "nested",
                    "include_in_parent" : false,
                    "properties" : {
                        "time" : {"type" : "string", "index":"not_analyzed"},
                        "event1" : {"type" : "string", "index":"not_analyzed"}
						}
					},
				"file3" : {
                    "type" : "nested",
                    "include_in_parent" : false,
                    "properties" : {
                        "time" : {"type" : "string", "index":"not_analyzed"},
                        "event2" : {"type" : "string", "index":"not_analyzed"}
						}
					}
                    }
                }
            },
    "settings" : {
        "number_of_shards" : 1,
        "number_of_replicas" : 0
    }
}'

I'm now trying to figure out how to get logstash to import my data with a structure like this. However when I try to use mutate to create a nested field I'm getting an "Exception in filterworker", "exception"=>#<IndexError: string not matched> error.

filter {

		mutate { 
			rename => [ "AGE", "[ID][AGE]" ]
		} 
}

The verbose exception is:

{:timestamp=>"2015-10-26T20:15:34.837000-0400", :message=>"Exception in filterworker", "exception"=>#<IndexError: string not matched>, "backtrace"=>["org/jruby/RubyString.java:3912:in `[]='", "C:/users/bgow/documents/MIMIC/III/ElasticSearch/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.4-java/lib/logstash/util/accessors.rb:64:in `set'", "C:/users/bgow/documents/MIMIC/III/ElasticSearch/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.4-java/lib/logstash/event.rb:146:in `[]='", "C:/users/bgow/documents/MIMIC/III/ElasticSearch/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-filter-mutate-1.0.1/lib/logstash/filters/mutate.rb:240:in `rename'", "org/jruby/RubyHash.java:1341:in `each'", "C:/users/bgow/documents/MIMIC/III/ElasticSearch/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-filter-mutate-1.0.1/lib/logstash/filters/mutate.rb:238:in `rename'", "C:/users/bgow/documents/MIMIC/III/ElasticSearch/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-filter-mutate-1.0.1/lib/logstash/filters/mutate.rb:211:in `filter'", "C:/users/bgow/documents/MIMIC/III/ElasticSearch/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.4-java/lib/logstash/filters/base.rb:163:in `multi_filter'", "org/jruby/RubyArray.java:1613:in `each'", "C:/users/bgow/documents/MIMIC/III/ElasticSearch/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.4-java/lib/logstash/filters/base.rb:160:in `multi_filter'", "(eval):71:in `filter_func'", "C:/users/bgow/documents/MIMIC/III/ElasticSearch/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.4-java/lib/logstash/pipeline.rb:219:in `filterworker'", "C:/users/bgow/documents/MIMIC/III/ElasticSearch/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.4-java/lib/logstash/pipeline.rb:157:in `start_filters'"], :level=>:error, :file=>"/users/bgow/documents/MIMIC/III/ElasticSearch/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.4-java/lib/logstash/pipeline.rb", :line=>"231", :method=>"filterworker"}

This appears to be similar to this bug. However in the bug they clearly state that this should only occur for fields containing numeric values but I'm using all strings. I even added characters in front of all numbers to make sure the import wasn't confusing them as numeric. Do you think this is related to the bug or am I doing something wrong?


(Brian Gow) #5

Do you have ram limits?
parent/child joins are held in memory so you have to calculate or estimate the amount of relationships.

Thanks that is good to know.

How about flattening it this way:
Age, Gender, Wave, Time, EVENT1, EVENT2 etc.. dynamically added to the type.
ie:
{ age: 49,
gender: M,
wave: 0,
timestamp: 4/20/2095,
event1: V39
}

I had been avoiding this approach because I'd like to be able to distinguish between the rows in my example for a given ID. In other words I'd like to be able to do a search which returns any ID's who have EVENT1=V39 and TIME=04/20/2095. Above this should return ID 1 but the issue is if we instead search for EVENT1=V39 and TIME=04/21/2095, it will still return ID 1 even though the time is actually for a different event of theirs. Is this how it works? This is why I thought I had to use Nested or Parent/Child.


(Brian Gow) #6

Since I can't seem to come up with a resolution for using Logstash to create nested structures, I'm now instead looking into creating JSON files from my CSV's and then importing them directly into Elasticsearch (ES).

I'm having an issue getting the desired result when inputting the information by file though. For a simplified subset of the data above when I import it into ES by ID; manually assigning the ID's as documents:

curl -XPUT "http://localhost:9200/forum_ms/subject/1" -d'
{
    "id" : "1",
    "file1" : [
        {
			"age" : "49",
            "gender" : "M",
            "wave" : "1"
        }
    ],
	"file2" : [
		{
			"time" : "04/20/2095",
			"event1" : "V39"
		},
		{
			"time" : "04/21/2095",
			"event1" : "T21"
		}
	]
}'

curl -XPUT "http://localhost:9200/forum_ms/subject/2" -d'
{
    "id" : "2",
    "file1" : [
        {
			"age" : "72",
            "gender" : "F",
            "wave" : "0"
        }
    ],
	"file2" : [
		{
			"time" : "05/17/2094",
			"event1" : "V39"
		}
	]
}'

I get the correct nested structure:

curl -XGET "http://localhost:9200/forum_ms/subject/_search?pretty"

{
            "_source": {
               "id": "1",
               "file1": [
                  {
                     "age": "49",
                     "gender": "M",
                     "wave": "1"
                  }
               ],
               "file2": [
                  {
                     "time": "04/20/2095",
                     "event1": "V39"
                  },
                  {
                     "time": "04/21/2095",
                     "event1": "T21"
                  }
               ]
            }
         },
         {
            "_index": "forum_ms",
            "_type": "subject",
            "_id": "2",
            "_score": 1,
            "_source": {
               "id": "2",
               "file1": [
                  {
                     "age": "72",
                     "gender": "F",
                     "wave": "0"
                  }
               ],
               "file2": [
                  {
                     "time": "05/17/2094",
                     "event1": "V39"
                  }
               ]
            }
         }
      ]
   }
}

However due to the number of subjects and files I need to import my files individually and have it automatically assign the documents to the correct ID. I've tried this but it doesn't produce the same structure - the documents aren't all grouped under a single ID:

curl -XPUT "http://localhost:9200/forum_ms/subject/_id" -d'
{
    "id" : "1",
	    "file1" : [
        {
			"age" : "49",
            "gender" : "M",
            "wave" : "1"
        }
    ],	
    "id" : "2",
	    "file1" : [
        {
			"age" : "72",
            "gender" : "F",
            "wave" : "0"
        }
    ]
}’

curl -XPUT "http://localhost:9200/forum_ms/subject/_update" -d'
{
    "id" : "1",
		"file2" : [
		{
			"time" : "04/20/2095",
			"event1" : "V39"
		},
		{
			"time" : "04/21/2095",
			"event1" : "T21"
		}
	],
    "id" : "2",
		"file2" : [
		{
			"time" : "05/17/2094",
			"event1" : "V39"
		}
	]
}'

Which gives this:

curl -XGET "http://localhost:9200/forum_ms/subject/_search?pretty"


      "_source":{
    "id" : "1",
        "file1" : [
        {
			"age" : "49",
            "gender" : "M",
            "wave" : "1"
        }
    ],	
    "id" : "2",
	    "file1" : [
        {
			"age" : "72",
            "gender" : "F",
            "wave" : "0"
        }
    ]
}’

    }, {
      "_index" : "forum_ms",
      "_type" : "subject",
      "_id" : "_update",
      "_score" : 1.0,
      "_source":{
    "id" : "1",
    	"file2" : [
		{
			"time" : "04/20/2095",
			"event1" : "V39"
		},
		{
			"time" : "04/21/2095",
			"event1" : "T21"
		}
	],
    "id" : "2",
		"file2" : [
		{
			"time" : "05/17/2094",
			"event1" : "V39"
		}
	]
}

Could you please help me figure out what I need to modify in this second case so that I get the desired result of all documents under a given ID (i.e. no repeated ID objects).


(Ziv) #7

Oh I missed the ID column with my example.
So an object should be of this sort:
{ id:1
age: 49,
gender: M,
wave: 0,
timestamp: 4/20/2095,
event1: V39
}
another one:
{ id:1
age: 49,
gender: M,
wave: 0,
timestamp: 4/21/2095,
event2: V40
}

Don't think of it as rows. Try thinking of it as all documents or json objects which have a certain fields.
For example
so to cut all events for a certain ID, is easy by just using term query on id=1 for example.
cutting by ID and time stamp, for all events on that timestamp, just use bool query of must term id=1 and term timestamp = 4/20/2095.

getting all IDs by timestamp, easy as well, just using term timestamp = 4/20/2095.
getting all IDS with event1, regardless of value, use exists filter on event1 field.
with value, just use event1=X.

This will return only the document(s) which have those properties. not different types.


(Brian Gow) #8

Thank you for the thoughts!

I'm probably not clearly explaining all of my requirements and I don't understand the Elasticsearch "query" and "filtering" as well as I'd like.

However with the structure you suggested I don't think I'd be able to identify a given ID who has particular events across JSON objects/documents. For example if all the example data was entered this way (each event1 row gets its own document and each event2 row gets it own document) and I wanted to search for all ID's who have event1 = V39. This should return ID:1 and 2. If I additionally wanted to know which ID's have event2 = P90 along with event1 = V39, could I write a filter search for this? I'd need some way to pass all of the ID's from the event1 = V39 to the event2 = E2 such that it only queried those ID's. Does this make sense? Could I do this without using either Nested or Parent/Child structures?

If I do end up using Nested structures could you recommend a method for getting this example data from CSV's into Elasticsearch (via Logstash, or first converting to JSON) which would also work for larger datasets (lots of rows - so it would need to be automated).


(Ziv) #9

In my way they are all together in one big happy json per event that happens. The only down side is that it uses more space for storage.
I refrain from answering your log stash questions as I've never used it, but I'll try explaining again this denormalised approach with elastic search.

First you have to understand that elastic search has dynamic types and fields, which means you don't have define your types prior to data insertion.
lets say you only have two types with mapping such as this:
mappings" : {
"my_event" : {
"properties" : {
"id" : {"type" : "string"},
"timestamp" : {"type" : "date"},
}
}

mappings" : {
"my_profile" : {
"properties" : {
"id" : {"type" : "string"},
"age" : {"type" : "integer"},
"gender" : {"type" : "string"},
"wave" : {"type" : "integer"},
}
}

so my event type is meatless, as it only holds ID and TIME.
however, when you insert a json with the type of my_event, and add a field called EVENT1 with a value,
the following will happen:
elastic will dynamically create and update the mapping and the new mapping will be:
mappings" : {
"my_event" : {
"properties" : {
"id" : {"type" : "string"},
"timestamp" : {"type" : "date"},
"EVENT1" : {"type" : "string"},

        }

}
And the same as you keep adding EVENT2 and so on dynamically.
each time you index a new event, it is stored as a separate "row" (document) which will have different fields every time.
so now you can use filter for my_event, for events which have event1 with value V39, and it return all associated events with the IDs that were in this event type.


(Brian Gow) #10

I stuck with using a nested structure and posted additional details about getting my data properly formatted and imported in these threads (1, 2, 3).


(system) #11