I am new to elasticsearch and want to index my website logs which are stored on HDFS for fast querying. I have a well structured pipeline which runs a script every 20 minutes to ingest the data into HDFS. I want to integrate elasticsearch with it, so that it also indexes these logs based on particular field(s) and thereby giving faster query results using spark SQL. So, my question is, can I index my data based on particular field(s) only? Also, my logs are saved in avro file format. Does es provides a way to directly index avro serialized data or do I need to convert it into some other format?
Of course. Simply take a look at the docs. Note that ES thinks in terms of docs rather than fields (which is more of a RDBMS concept). In other words, you can simply throw the documents at it and be done with. The ES documentation explains the various indexing options you have including mapping (which looks like you need).
ES itself doesn't read the data; it is the Elasticsearch Hadoop connector that does this. And yes, it supports the Avro format. Once you have picked your library (Map/Reduce, Hive, Cascading, etc...) simply configure it to read the file just as you typically do in Hadoop and plug the connector on the other side to 'fan' the data out to Elasticsearch.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.