ES logs errors with large initial logfiles

hello,
I use fluentd, elasticsearch 1.2.4 and kibana 3.1.2 on a ubuntu trusty platform to analyze logfiles I have.
I have an initial logfile that has to be taken into account.
In trying to get the stack workingk, I found out that when I start with a empty collection and I put the logfile in the fluentd-directory, Kibana will not show all or any records.

This is due to error messages in the ES log. The logging says things like

FacetParseElement.java:93)
at org.elasticsearch.search.SearchService.parseSource(SearchService.java:633)
... 9 more
[2016-02-10 09:38:25,344][DEBUG][action.search.type ] [Bling] [logstash-2016.02.08][1], node[cEvgxKytQbGKgsZxGZZXJQ], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@3ca82a1c] lastShard [true]
org.elasticsearch.search.SearchParseException: [logstash-2016.02.08][1]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"facets":{"0":{"date_histogram":{"field":"@timestamp","interval":"1h"},"global":true,"facet_filter":{"fquery":{"query":{"filtered":{"query":{"query_string":{"query":"*"}},"filter":{"bool":{"must":[{"range":{"@timestamp":{"from":1454488705089,"to":1455093505089}}}]}}}}}}}},"size":0}]]
at org.elasticsearch.search.SearchService.parseSource(SearchService.java:649)
at org.elasticsearch.search.SearchService.createContext(SearchService.java:511)
at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:483)
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:252)
at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:206)
at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:203)
at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:517)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.search.facet.FacetPhaseExecutionException: Facet [0]: (key) field [@timestamp] not found
at org.elasticsearch.search.facet.datehistogram.DateHistogramFacetParser.parse(DateHistogramFacetParser.java:160)
at org.elasticsearch.search.facet.FacetParseElement.parse(FacetParseElement.java:93)
at org.elasticsearch.search.SearchService.parseSource(SearchService.java:633)
... 9 more

I first thought my data is not OK, but all records are good. I now have the situation that the json-logfile results in the above logging, but when I cut the file in two pieces and feed that to fluentd, kibana nor elasticsearch have problems.

This COULD be a fluentd issue, it could be a ES issue in processing large numbers of records, but I have not the slightest idea how to track this down. I would be very interested in the data that causes ES to give the stacktrace, but I have not a clue how to do it.
Has someone had a similar situation or advice on what to do next?

Ruud

Caused by: org.elasticsearch.search.facet.FacetPhaseExecutionException: Facet [0]: (key) field [@timestamp] not found

It looks like your events don't have a @timestamp field yet Kibana is configured to use that field as the timestamp field (I believe this was configurable in Kibana 3).

No. the data is OK. The splitted files, 700 lines appr. each, all contain a @timestamp field

Ruud

I suspect either fluentd or ES has a problem with the size of the logfile. Maybe fluentd cuts the logfile in events some of which cause ES to receive invalid data. OR elasticsearch cannot process the large number of events.
I don't now how to intercept/inspect the records that ES receives. Any ideas?

tcpdump/Wireshark?

hi Magnus, that was a good hint.

Using tcpdump, I listened on port 9200 and dumped all traffic to a file.
The problem is that the index of some days get corrupted and other not.

In the dump, I could see that the data that elasticsearch receives from fluentd is cut up in records. Those records have not the same boundary as json-records, for example:

====snip=====
17:05:06.524209 IP 172.22.25.227.33655 > 172.22.25.227.9200: Flags [P.], seq 175207:184691, ack 1, win 342, options [nop,nop,TS val 12919195 ecr 12919195], length 9484
E.%@..@.@............w#.e:B..U.d...V.%.....
..!...!."success":true,"path":"/data/factuur-logs/a22.log"}
{"@timestamp":"2016-01-05T17:10:00.000+00:00","url":"192.168.1.137","component":"pdfserver","factuurnummer":"21737122","een_klantnummer":"103652575","success":true,"path":"/data/factuur-logs/a22.log"}
: : :
{"@timestamp":"2015-12-18T13:09:00.000+00:00","url":"192.168.1.137","component":"pdfserver","factuurnummer":"21695819","een_klantnummer":"103641633","success":true,"path":"/data/factuur-logs/a22.log"}
{"@timestamp" <<<<= END OF RECORD HERE
17:05:06.521356 IP 172.22.25.227.9200 > 172.22.25.227.33655: Flags [.], ack 21991, win 1365, options [nop,nop,TS val 12919194 ecr 12919194], length 0
E..4..@.@...........#..w.U.de7.c...U.......
..!...!.
17:05:06.523966 IP 172.22.25.227.33655 > 172.22.25.227.9200: Flags [P.], seq 21991:43879, ack 1, win 342, options [nop,nop,TS val 12919195 ecr 12919194], length 21888
E.U...@.@..f.........w#.e7.c.U.d...V.......
..!...!.:"2015-12-18T13:09:00.000+00:00","url":"192.168.1.137","component":"pdfserver","factuurnummer":"21695819","een_klantnummer":"103641633","success":true,"path":"/data/factuur-logs/a22.log"}
{"@timestamp":"2015-12-18T13:09:00.000+00:00","url":"192.168.1.137","component":"pdfserver","factuurnummer":"21695010","een_klantnummer":"103569185","success":true,"path":"/data/factuur-logs/a22.log"}
====== /snip ======

It turns out that is not a problem: I could find all above records back in both kibana and using a ruby script.

However, problems arise as the record boundary cuts a time stamp in two:

====== snip ======
{"@timestamp":"2015-12-22T11:09:00.000+00:00","url":"192.168.1.137","component":"pdfserver","factuurnummer":"21694829","een_klantnummer":"103414322","success":true,"path":"/data/factuur-logs/a22.log"}
{"@timestamp":"2015-12-22T11:21:00.000+00:00","url":"192.168.1.137","component":"pdfserver","factuurnummer":"21686466","een_klantnummer":"000630343","success":true,"path":"/data/factuur-logs/a22.log"}
{"@timestamp":"2015-12-22T11:21:0 <<<<< END OF RECORD HERE
17:05:06.524087 IP 172.22.25.227.33655 > 172.22.25.227.9200: Flags [P.], seq 65767:87655, ack 1, win 342, options [nop,nop,TS val 12919195 ecr 12919194], length 21888
E.U...@.@..d.........w#.e8.c.U.d...V.......
..!...!.0.000+00:00","url":"192.168.1.137","component":"pdfserver","factuurnummer":"21686466","een_klantnummer":"000630343","success":true,"path":"/data/factuur-logs/a22.log"}
{"@timestamp":"2015-12-22T11:26:00.000+00:00","url":"192.168.1.137","component":"pdfserver","factuurnummer":"21712552","een_klantnummer":"103648900","success":true,"path":"/data/factuur-logs/a22.log"}
{"@timestamp":"2015-12-22T11:30:00.000+00:00","url":"192.168.1.137","component":"pdfserver","factuurnummer":"21712552","een_klantnummer":"103648900","success":true,"path":"/data/factuur-logs/a22.log"}
====== /snip ======

in this case, none of the records from 2015-12-22 can be retreived with either kibana or the previously mentioned ruby script.

I have encountered the same problem earlier, and I have had the same kind of problems when I used logstash. That only arise when we start with initial large files. That get chopped up. Now and then elasticsearch would 'forget records'. Now I understand that the chance a json record is chopped up at a place ES doesn't lke, grows with the volume of a log file.

Would it be possible this problem can be circumvented by proper ES configuration, or is this a bug that has to be fixed in the ES code?

thanks in advamce. Ruud

I just found out that all json-records that are cut in two result in NO hit record from ES. If the cut is inside the timestamp, it results in a corrupted day-index and no records of that day can be retreived.

Ruud

It seems the size of the logfile is not so much the issue, as it is the number of records ES gets via port 9200.
When I split up the initial logfile from 700 lines into two files of 350 lines and process them on initialization, the same error occurs.

When I feed the two files one after the other to fluentd, thus forcing fluentd not to overfeed ES, all records end up in the ES indexes.

Clearly, ES has to little time or memory to do its job without dropping records.
But now: how can I find out what I have to tune? The log file in /var/log/elasticsearch says nothing about dropping records. Which is not nice too. What can I do to get elasticsearch do its job and give me the assurance that when it does not so, I get a notification of this, so that I can take proper measurements?

regards, Ruud

It turned out that records get rejected because of the bulk queue size, which was too small. ES responds on my system to that situation by rejecting records that are on a tcp record boundary.

After changing the thread pool configuration in elasticsearch.yml with

threadpool.bulk.type: fixed
threadpool.bulk.size: 100
threadpool.bulk.queue_size: 2000

no more records were rejected.
I still find it difficult to accept that in this situation records silently get rejected without any trace in the logging and without any hint what is going on. But I am glad the situation is back to normal by now.

Ruud