Elastic Cloud hosting
Elasticsearch 7.10.0
Filebeat 7.10.1
We've been unable to get files from S3 input to successfully apply the configured multiline options on that input. Despite our best efforts, lines that should be consolidated as a single event are being sent into Elasticsearch individually.
S3 input section of filebeat.yml
filebeat.inputs:
- type: s3
queue_url: "${QUEUE_URL}"
multiline.type: pattern
multiline.pattern: '^\d{4}-\d{2}-\d{2}'
multiline.negate: true
multiline.match: after
When testing Filebeat locally using basic Log input and File output, the same multiline configuration appears to be working as expected.
filebeat.yml
filebeat.inputs:
- type: log
paths:
- /PATH/local_filebeat_logs/*
multiline.type: pattern
multiline.pattern: '^\d{4}-\d{2}-\d{2}'
multiline.negate: true
multiline.match: after
Sample input file
2020-12-28 11:10:19,800 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): S3DistCp args: --s3Endpoint=s3.us-east-1.amazonaws.com --src=hdfs:///date=20201228/hour=09 --dest=s3://BUCKET/PATH/date=20201228/hour=09/1609171752178
2020-12-28 11:10:44,825 INFO org.apache.hadoop.mapreduce.Job (main): Job job_1609171623625_0002 completed successfully
2020-12-28 11:10:44,919 INFO org.apache.hadoop.mapreduce.Job (main): Counters: 54
File System Counters
FILE: Number of bytes read=1935
FILE: Number of bytes written=1419609
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=3425
HDFS: Number of bytes written=0
HDFS: Number of read operations=42
HDFS: Number of large read operations=0
HDFS: Number of write operations=14
S3: Number of bytes read=0
S3: Number of bytes written=296
S3: Number of read operations=0
S3: Number of large read operations=0
S3: Number of write operations=0
Job Counters
Launched map tasks=1
Launched reduce tasks=7
Rack-local map tasks=1
Total time spent by all maps in occupied slots (ms)=192576
Total time spent by all reduces in occupied slots (ms)=7959168
Total time spent by all map tasks (ms)=2006
Total time spent by all reduce tasks (ms)=41454
Total vcore-milliseconds taken by all map tasks=2006
Total vcore-milliseconds taken by all reduce tasks=41454
Total megabyte-milliseconds taken by all map tasks=6162432
Total megabyte-milliseconds taken by all reduce tasks=254693376
Map-Reduce Framework
Map input records=9
Map output records=9
Map output bytes=3401
Map output materialized bytes=1907
Input split bytes=154
Combine input records=0
Combine output records=0
Reduce input groups=9
Reduce shuffle bytes=1907
Reduce input records=9
Reduce output records=0
Spilled Records=18
Shuffled Maps =7
Failed Shuffles=0
Merged Map outputs=7
GC time elapsed (ms)=1440
CPU time spent (ms)=42960
Physical memory (bytes) snapshot=3794964480
Virtual memory (bytes) snapshot=55973384192
Total committed heap usage (bytes)=5106565120
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=2975
File Output Format Counters
Bytes Written=0
2020-12-28 11:10:44,920 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): Try to recursively delete hdfs:/tmp/2c6f4478-befb-49ad-babe-f9f2e8e4f6e0
The end result of this sample file should be 4 events in Elasticsearch.
The only other difference I can think to mention is that the sample input log file is gzipped (.gz) in S3. The S3 input clearly has no trouble decompressing lines from the .gz but it still isn't correctly applying the multiline configuration as desired.