Ingesting JSON files, format problem?


(Craig Foote) #1

I have a series of JSON files, each with one rather large, single-line JSON document. I'm trying to ingest via the file input plugin:

input{
	file{
		path => "/path/to/files/*.txt"
		start-position => "beginning"
		sincedb_path => "/dev/null"
		codec => "json"
	}
}
output{
	stdout{ codec => rubydebug }
}

I think there's something wrong with my json format because it works when I put a test.txt file in that folder with content:
{"test1":"test2"}

The series of json files I want to ingest are of the format:

{"messages":[{"text": ["The result was retrieved successfully."], "level": "INFO"}], "results":[{"field1": "value1", "field2": [{"field2a": "value2a", "field2b": "value2b"}], "field3": "value3"}]}

Is there a problem here? I don't get any errors and the files are found but I see nothing from rubydebug. Could it be the length of the files, they are a couple MB each.


How to ingest multiple json documents from a folder into ES using logstash
(Vincent Tran) #2

There are a few things you could try:

  1. Use the json filter instead of codec
  2. Add --debug switch to see what is going on

(Craig Foote) #3

Tried the json filter, same result. I have --debug turned on so I see it finding the files and recognizing their "new" size, seeking beginning, /dev/null sincedb_path, etc. Just nothing done with the files apparently. Then I see the discovery polling every 15 seconds. I've waited minutes thinking it just might be busy but nothing happening :(.


(Craig Foote) #4

Seems it's a file length problem. I shortened one of the files by removing several of the "field2*" and it works.

{"messages":[{"text": ["The result was retrieved successfully."], "level": "INFO"}], "results":[{"field1": "value1", "field2": [{"field2a": "value2a", "field2b": "value2b"}], "field3": "value3"}]}

Is this a problem with the file input or the rubydebug? What can I do to ingest large files?


(Magnus B├Ąck) #5

Does your configuration file actually say start-position instead of start_position? Logstash 2.1 rejects such misspellings, but I thought you might be running something older that doesn't. Anyway, this works for me (LS 2.1):

$ cat data
{"messages":[{"text": ["The result was retrieved successfully."], "level": "INFO"}], "results":[{"field1": "value1", "field2": [{"field2a": "value2a", "field2b": "value2b"}], "field3": "value3"}]}
$ cat test.config
input {
  file {
    start_position => "beginning"
    path => "/tmp/trash.K06h/data"
    sincedb_path => "/dev/null"
    codec => "json"
  }
}
output { stdout { codec => rubydebug } }
$ /opt/logstash/bin/logstash -f test.config
Settings: Default filter workers: 1
Logstash startup completed
{
      "messages" => [
        [0] {
             "text" => [
                [0] "The result was retrieved successfully."
            ],
            "level" => "INFO"
        }
    ],
       "results" => [
        [0] {
            "field1" => "value1",
            "field2" => [
                [0] {
                    "field2a" => "value2a",
                    "field2b" => "value2b"
                }
            ],
            "field3" => "value3"
        }
    ],
      "@version" => "1",
    "@timestamp" => "2015-12-04T20:11:26.188Z",
          "host" => "hallonet",
          "path" => "/tmp/trash.K06h/data"
}
^CSIGINT received. Shutting down the pipeline. {:level=>:warn}
Logstash shutdown completed

(Craig Foote) #6

I did use "start_position", not "start-position" as I typed. I hate working on a non-internet connected machine and retyping everything in these forums, I always make transcription errors :frowning:

Anyway, my {"hello": "world"} json document works and a shortened version of one of my json documents works so I'm pretty sure it's the length of the documents. I changed discover_interval to 30 seconds but no change and I tried using plain codec with json filter. Any other ideas? Could this be our underpowered cluster or an artifact of some logstash piece?


(Craig Foote) #7

I checked one of the json documents, there are over 5000 "field2*" objects.


(Vincent Tran) #8

I've seen this issue (logstash recognizes the line but doesn't do anything with it, as if it's waiting for something) in the past with the json filter (haven't tried the json codec) when the json line (I assume the json object is on one single line here?) doesn't have a line terminator.

If you're on linux, run file <filename>.txt . See what you get. If it has no line terminator, add a new line ('\n') at the end.


(Craig Foote) #9

Thanks Vincent, that was the problem.

Now I'm getting:

"error"=>{"type"=>"mapper_parsing_exception", "reason"=>"Merging dynamic updates triggered a conflict: mapper [results.typedValues.value] of different type, current_type [string], merged_type [date]"}

Looks like some of the json objects' values are strings and others dates. Not sure what i can do about that... Any ideas?


(Vincent Tran) #10

That's correct. Most likely you have an array of objects (results.typeValues.value) of mismatching type. ES expects that fields with the same name in objects belonging to the same array to have matching type (i.e. if the very first object of the array has field value with type string, any subsequent object that contains the field value need to map that field to string type)

One technique I've used in the past to flatten out these kind of imperfect arrays was using json filter or ruby filter to turn the array into objects of objects (hashes).


(Craig Foote) #11

I think you're right Vincent. Can you provide more details on how you used the JSON filter to transform the array of objects please?


(Vincent Tran) #12

Why don't you give me an example of your jagged array, and I will create a sample filter to handle it.


(Craig Foote) #13

That would be difficult because our data is under security constraints. the best i can do is the pseudo-snippet I used above:

{"messages":[{"text": ["The result was retrieved successfully."],
"level": "INFO"}], "results":[{"field1": "value1", "field2":
[{"field2a": "value2a", "field2b": "value2b"}], "field3": "value3"}]}

Here, "value2a" would be a date but "value2b" would just be a string.


(system) #14