Logstash mutate filter, rename - "Exception in filterworker", "exception"=>#<IndexError: string not matched>


(Brian Gow) #1

I'm trying to import CSV files into a nested structure in Elasticsearch (v.1.7.3), on a Windows 7 Professional machine. When I use the mutate filter with rename in Logstash (v.1.5.4) I get an:

{:timestamp=>"2015-10-26T20:15:34.837000-0400", :message=>"Exception in filterworker", "exception"=>#<IndexError: string not matched>, "backtrace"=>["org/jruby/RubyString.java:3912:in `[]='", "C:/users/bgow/documents/MIMIC/III/ElasticSearch/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.4-java/lib/logstash/util/accessors.rb:64:in `set'", "C:/users/bgow/documents/MIMIC/III/ElasticSearch/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.4-java/lib/logstash/event.rb:146:in `[]='", "C:/users/bgow/documents/MIMIC/III/ElasticSearch/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-filter-mutate-1.0.1/lib/logstash/filters/mutate.rb:240:in `rename'", "org/jruby/RubyHash.java:1341:in `each'", "C:/users/bgow/documents/MIMIC/III/ElasticSearch/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-filter-mutate-1.0.1/lib/logstash/filters/mutate.rb:238:in `rename'", "C:/users/bgow/documents/MIMIC/III/ElasticSearch/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-filter-mutate-1.0.1/lib/logstash/filters/mutate.rb:211:in `filter'", "C:/users/bgow/documents/MIMIC/III/ElasticSearch/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.4-java/lib/logstash/filters/base.rb:163:in `multi_filter'", "org/jruby/RubyArray.java:1613:in `each'", "C:/users/bgow/documents/MIMIC/III/ElasticSearch/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.4-java/lib/logstash/filters/base.rb:160:in `multi_filter'", "(eval):71:in `filter_func'", "C:/users/bgow/documents/MIMIC/III/ElasticSearch/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.4-java/lib/logstash/pipeline.rb:219:in `filterworker'", "C:/users/bgow/documents/MIMIC/III/ElasticSearch/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.4-java/lib/logstash/pipeline.rb:157:in `start_filters'"], :level=>:error, :file=>"/users/bgow/documents/MIMIC/III/ElasticSearch/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.4-java/lib/logstash/pipeline.rb", :line=>"231", :method=>"filterworker"}

error message (with verbosity turned on).

The filters in my .conf look like:

filter {

		csv {
			columns => ["ID","AGE","GENDER","WAVE"]
			separator => ","
		}

}

filter {

		mutate { 
			rename => [ "AGE", "[ID][AGE]" ]
		} 
} 

When I use the same .conf without the mutate filter I don't get any errors but obviously don't get a nested structure as a result.

I found this bug which seems to indicate that there is a similar issue when using mutate in Logstash with numeric fields. However I am still seeing the error even when importing only strings. I even added characters in front of any numbers to make sure Logstash wasn't interpreting the string as numeric but I still get the filterworker exception.

Do you think this is the same bug as the one listed or is it possible that I'm doing something wrong?


(Magnus Bäck) #2

You're trying to make the AGE field a subfield of the ID field, but that field is a string. That won't work. You should view Logstash events as JSON objects, and the values in a JSON object are either objects, arrays, or scalars. It's only the former that can contain other values (subfields in Logstash-speak).

If you describe what you want to accomplish (i.e. what you want the events to look like in the end) we can help you get there, but right know it looks like you're trying to do something that doesn't fit Logstash's data model.


(Brian Gow) #3

Ok, that makes sense. I have posted details about what I'm trying to accomplish here. A simplified version of that would be:

Take CSV files with this kind of structure:

file1:

id,age,gender,wave
1,49,M,1
2,72,F,0

file2:

id,time,event1
1,4/20/2095,V39
1,4/21/2095,T21
2,5/17/2094,V39

file3:

id,time,event2
1,4/22/2095,P90
2,5/18/2094,E2

and create an Elasticsearch index where "id" is the root/parent with each "file#" nested under a given "id". Obviously there should only be one "id" object in the outputted JSON even though there are multiple files for each "id" and often multiple rows per "id" in a given file. When manually entering this I use a mapping to assign the nested "file#" fields and set the property fields to "not analyzed". For completeness this resulting index structure is what I desire:

GET /forumlogst/subject/_search
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 1,
      "hits": [
         {
            "_index": "forumlogst",
            "_type": "subject",
            "_id": "1",
            "_score": 1,
            "_source": {
               "id": "1",
               "file1": [
                  {
                     "age": "49",
                     "gender": "M",
                     "wave": "1"
                  }
               ],
               "file2": [
                  {
                     "time": "04/20/2095",
                     "event1": "V39"
                  },
                  {
                     "time": "04/21/2095",
                     "event1": "T21"
                  }
               ],
               "file3": [
                  {
                     "time": "04/22/2095",
                     "event2": "P90"
                  }
               ]
            }
         },
         {
            "_index": "forumlogst",
            "_type": "subject",
            "_id": "2",
            "_score": 1,
            "_source": {
               "id": "2",
               "file1": [
                  {
                     "age": "72",
                     "gender": "F",
                     "wave": "0"
                  }
               ],
               "file2": [
                  {
                     "time": "05/17/2094",
                     "event1": "V39"
                  }
               ],
               "file3": [
                  {
                     "time": "04/22/2095",
                     "event2": "E2"
                  }
               ]
            }
         }
      ]
   }
}

I'm hoping Logstash can help automate the creation of this kind of index as I will have thousands of "id"s and hundreds of event types. Each "id" may also have a couple hundred events of a particular type (Ex: "id" 3 may have 250 event2 rows in their CSV). I realize this will create a very large JSON file/index. I'm choosing this Nested approach over Parent/Child structures primarily because we desire support in Kibana for aggregation across nested fields which is planned for v4.4. I welcome any advice though.


Porting a relational structure to Elasticsearch - Nested or Parent/Child
(Magnus Bäck) #4

This kind of aggregation of input events (or, perhaps, incremental updates of documents in ES) is something Logstash isn't very good at. I'm not saying it's impossible but I'd look into using something else for this.


(Brian Gow) #5

Thanks, do you have any recommendations for what else to use? I wrote a script in Matlab to take my CSV files and parse them into the format I want for use with the bulk API:

{"index":{"_index":"forum_mat","_type":"subject","_id":"1"}}
{"id":"1","file1":[{"filen":"file1","id":"1","age":"49","gender":"M","wave":"1"}],"file2":[{"filen":"file2","id":"1","time":"4/20/2095","event1":"V39"},{"filen":"file2","id":"1","time":"4/21/2095","event1":"T21"}]}
{"index":{"_index":"forum_mat","_type":"subject","_id":"2"}}
{"id":"2","file1":[{"filen":"file1","id":"2","age":"72","gender":"F","wave":"0"}],"file2":[{"filen":"file2","id":"2","time":"5/17/2094","event1":"V39"},{"filen":"file2","id":"2","time":"5/18/2094","event1":"R9"},{"filen":"file2","id":"2","time":"5/20/2094","event1":"Q20"}]}

This appears to work just fine but as I was worried about is much to slow. To complete this run on all 40+GB of my data might take months. Is there something you recommend which is efficient at parsing and aggregating data (by id in this case)?

Alternatively is there another solution instead of striving for a single JSON file with all the information across my files nested under a given id?

I need to be able to perform searches which are row/document aware and also can aggregate by id across files. For example with the example data here I might need to search for any id's which have wave=1 in file1 and in file2 have event1=V39 in one row/document and also have event1=T21 and 4/21/2095 (in their own row/document). I also want to be able to use kibana to visualize my data across files and rows. For example (with this small dataset) I might want a pie chart of those with wave=1 and 0 aggregated by event1 and subaggregated by event2. With the nested structure I mention above this works in the kibana development branch (and is scheduled to be released in v4.4). Please let me know if there is more straight forward approach to accomplishing this capability with the ELK stack.


(Brian Gow) #6

I ended up using my SQL program Postgres to export JSON in the required format for the Bulk API. Additional details regarding this can be found in these threads (1, 2).


(system) #7