Ingesting JSON files, format problem?

CraigFoote · December 4, 2015, 6:35pm

I have a series of JSON files, each with one rather large, single-line JSON document. I'm trying to ingest via the file input plugin:

input{
	file{
		path => "/path/to/files/*.txt"
		start-position => "beginning"
		sincedb_path => "/dev/null"
		codec => "json"
	}
}
output{
	stdout{ codec => rubydebug }
}

I think there's something wrong with my json format because it works when I put a test.txt file in that folder with content:
{"test1":"test2"}

The series of json files I want to ingest are of the format:

{"messages":[{"text": ["The result was retrieved successfully."], "level": "INFO"}], "results":[{"field1": "value1", "field2": [{"field2a": "value2a", "field2b": "value2b"}], "field3": "value3"}]}

Is there a problem here? I don't get any errors and the files are found but I see nothing from rubydebug. Could it be the length of the files, they are a couple MB each.

vtst2412 · December 4, 2015, 7:26pm

There are a few things you could try:

Use the json filter instead of codec
Add --debug switch to see what is going on

CraigFoote · December 4, 2015, 7:53pm

Tried the json filter, same result. I have --debug turned on so I see it finding the files and recognizing their "new" size, seeking beginning, /dev/null sincedb_path, etc. Just nothing done with the files apparently. Then I see the discovery polling every 15 seconds. I've waited minutes thinking it just might be busy but nothing happening :(.

CraigFoote · December 4, 2015, 8:10pm

Seems it's a file length problem. I shortened one of the files by removing several of the "field2*" and it works.

{"messages":[{"text": ["The result was retrieved successfully."], "level": "INFO"}], "results":[{"field1": "value1", "field2": [{"field2a": "value2a", "field2b": "value2b"}], "field3": "value3"}]}

Is this a problem with the file input or the rubydebug? What can I do to ingest large files?

magnusbaeck · December 4, 2015, 8:12pm

Does your configuration file actually say start-position instead of start_position? Logstash 2.1 rejects such misspellings, but I thought you might be running something older that doesn't. Anyway, this works for me (LS 2.1):

$ cat data
{"messages":[{"text": ["The result was retrieved successfully."], "level": "INFO"}], "results":[{"field1": "value1", "field2": [{"field2a": "value2a", "field2b": "value2b"}], "field3": "value3"}]}
$ cat test.config
input {
  file {
    start_position => "beginning"
    path => "/tmp/trash.K06h/data"
    sincedb_path => "/dev/null"
    codec => "json"
  }
}
output { stdout { codec => rubydebug } }
$ /opt/logstash/bin/logstash -f test.config
Settings: Default filter workers: 1
Logstash startup completed
{
      "messages" => [
        [0] {
             "text" => [
                [0] "The result was retrieved successfully."
            ],
            "level" => "INFO"
        }
    ],
       "results" => [
        [0] {
            "field1" => "value1",
            "field2" => [
                [0] {
                    "field2a" => "value2a",
                    "field2b" => "value2b"
                }
            ],
            "field3" => "value3"
        }
    ],
      "@version" => "1",
    "@timestamp" => "2015-12-04T20:11:26.188Z",
          "host" => "hallonet",
          "path" => "/tmp/trash.K06h/data"
}
^CSIGINT received. Shutting down the pipeline. {:level=>:warn}
Logstash shutdown completed

CraigFoote · December 4, 2015, 8:18pm

I did use "start_position", not "start-position" as I typed. I hate working on a non-internet connected machine and retyping everything in these forums, I always make transcription errors

Anyway, my {"hello": "world"} json document works and a shortened version of one of my json documents works so I'm pretty sure it's the length of the documents. I changed discover_interval to 30 seconds but no change and I tried using plain codec with json filter. Any other ideas? Could this be our underpowered cluster or an artifact of some logstash piece?

CraigFoote · December 4, 2015, 8:24pm

I checked one of the json documents, there are over 5000 "field2*" objects.

vtst2412 · December 5, 2015, 1:10am

I've seen this issue (logstash recognizes the line but doesn't do anything with it, as if it's waiting for something) in the past with the json filter (haven't tried the json codec) when the json line (I assume the json object is on one single line here?) doesn't have a line terminator.

If you're on linux, run file <filename>.txt . See what you get. If it has no line terminator, add a new line ('\n') at the end.

CraigFoote · December 7, 2015, 2:24pm

Thanks Vincent, that was the problem.

Now I'm getting:

"error"=>{"type"=>"mapper_parsing_exception", "reason"=>"Merging dynamic updates triggered a conflict: mapper [results.typedValues.value] of different type, current_type [string], merged_type [date]"}

Looks like some of the json objects' values are strings and others dates. Not sure what i can do about that... Any ideas?

vtst2412 · December 7, 2015, 2:57pm

That's correct. Most likely you have an array of objects (results.typeValues.value) of mismatching type. ES expects that fields with the same name in objects belonging to the same array to have matching type (i.e. if the very first object of the array has field value with type string, any subsequent object that contains the field value need to map that field to string type)

One technique I've used in the past to flatten out these kind of imperfect arrays was using json filter or ruby filter to turn the array into objects of objects (hashes).

CraigFoote · December 9, 2015, 2:39pm

I think you're right Vincent. Can you provide more details on how you used the JSON filter to transform the array of objects please?

vtst2412 · December 9, 2015, 5:16pm

Why don't you give me an example of your jagged array, and I will create a sample filter to handle it.

CraigFoote · December 9, 2015, 6:02pm

That would be difficult because our data is under security constraints. the best i can do is the pseudo-snippet I used above:

{"messages":[{"text": ["The result was retrieved successfully."],
"level": "INFO"}], "results":[{"field1": "value1", "field2":
[{"field2a": "value2a", "field2b": "value2b"}], "field3": "value3"}]}

Here, "value2a" would be a date but "value2b" would just be a string.

Topic		Replies	Views
How to ingest multiple json documents from a folder into ES using logstash Logstash	2	2128	July 6, 2017
Best way to ingest JSON files? Logstash	6	1656	July 6, 2017
Ingesting large number of files in a directory using logstash Logstash ingest-pipeline	4	931	April 21, 2021
JSON Parsing Error - Logstash Logstash	2	314	January 2, 2020
Ingest multiple json files under a folder Logstash	11	2424	August 8, 2019

Ingesting JSON files, format problem?

Related topics