I am retrieving indices from an elasticsearch containing syslogs messages through a curl commande then they are read by a logstash.
and want to extract 4 fields from this messages (datetime_received,"ip_host_pkt","source_msg, "_type") and renamed them to date,host, message, type.
the elastic source is secured and not Under my authority, the only way to access it is through the proxy api and only with a POST commande to retrieve an authentication token to launch curl GET command.
also I have a full conf logstash ready but to receive syslog in csv format.
But now I am getting json file as input, so I want to read the multiline json and extract 4 fields (date, msg, host, type) and transform then as csv to process them with my ready logstash csv conf.
If all you need to do is to rename and remove fields, you can do this inside Elasticsearch without extracting the documents, using Ingest operations, here is an example based on your example data:
My problem is that the format of the data source has change from csv to json and I want to lean on the dev done in logstash csv conf to enrich data, geo,...
So we can't bypass the logstash.
I have syslogs in json (elastic extract) format and need to retrieve 4 fields convert those as csv and process the csv with all the logstash filters already set.
Thanks @moughrom my question is how is the JSON format extracted from Elasticsearch? I understand curl is involved. I do not understand what the request is and this makes a difference as there are many options. Please could you give an example of the request (with the auth token redacted). Thanks
OK that makes sense. Are the curl commands issued by a bash script &/or is there any other scripted processing before the raw elasticsearch query output is read by logstash?
That's great. Logstash has codecs to deal with multiline messages and with JSON, however, chaining codecs is not supported (the reasons are documented in Github). As you are using bash script this gives us an opportunity to process the extracted JSON, which contains multiple documents as well as metadata for the query, into a file containing newline-delimited JSON with one line per document.
curl -X GET http://0.0.0.0:9200/index-2018-08-09/_search | sed -e "s/^.*_source\"://" | grep -v "}]}}" | awk '/^}/ {print (NR==1?"":RS)$0;next} {printf "%s",$0}' | grep -v "^}$" | sed -e "s/$/\}/" > /Users/Shared/logs/test.json
You can add your additional curl parameters and Elasticsearch instance.
Then, you can use Logstash to ingest this using the JSON codec. Use the JSON not json_line.
input {
# Read all documents from Elasticsearch matching the given query
file {
path => "/Users/Shared/logs/*.json"
codec => json
}
}
Logstash filters can then rename, remove etc. You can add any extra filters you already have here.
I am only getting one syslog message in the json output file, I have change > to >> to concatanate, but still getting one msg at max I am getting 3 msg when I use -d '{"from":"zero","size":3000}'
curl -X GET http://0.0.0.0:9200/index-2018-08-09/_search | sed -e "s/^.*_source"://" | grep -v "}]}}" | awk '/^}/ {print (NR==1?"":RS)$0;next} {printf "%s",$0}' | grep -v "^}$" | sed -e "s/$/}/" >> /Users/Shared/logs/test.json
I don't have the resources to fully replicate your process offline.
Here are some basic pointers:
The simplest way to "fix" Elasticsearch data is inside the cluster using painless. You could do this using appropriate requests and the AUTH tokens, with the limitation that you would not have the added value of the logstash geoip lookup. However, all the other field operations in your requirements could be performed simply and reliably inside the cluster without extracting the data.
If you do need to pull the data out, you may need to do some parsing before resubmitting to Elasticsearch. I would choose the technology that you are most comfortable with to do this.
From your replies above ^^ , it looks like AWK is not matching your pattern. You could fix this by looking more into AWK, however, I would contra-indicate the append operation ">>" as you don't want to write the same document multiple times to the file.
The reason I suggested AWK was to avoid adding more technology, however, there are many options of third-party open-source scripting languages that you could use to process the data to a format that is ready for ingest via logstash. I would suggest to go with whichever you are most experienced in.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.