So a follow up question, because I neglected to explain my first question.
I have multiple CSV files (field1 is same within a file), assume each has three fields and each file is uniquely identifiable by filed1, and field1 value will repeat for entirety of that particular file, for example:
CSV1 with field1 = 1000:
field1|field2|field3
CSV2 with field1 = 2000:
field1|field2|field3
CSV3 with field1 = 3000:
field1|field2|field3
by adding if [field1] != "1000" { drop {} }, it will will read through all entries for all CSVs? correct? Reason i ask is because each CSV will have millions of rows, and reading a file that is not relevant can have performance impact. in that case i would need to think of another solution.
That will drop any event where [field1] is not "1000", so it will drop every event from the second and third CSVs. But if you want to do that then why read them in the first place?
sure, because all files arrive in a shared directory and we want correct pipeline to ingest the relevant file. for example, we don't want pipeline for field1=1000 ingesting data from CSV with field1=2000. So, at least I am not aware of any way to check before ingesting, the value of field1.
and the way source system is setup we cannot have separate directories.
Your suggestion worked in my test and i did not see any performance issue.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.