I am new to elasticsearch and am looking for guidance on how to do
faceting on our fairly large (1.2 billion syslog records per month) log
file collection, which we are currently loading into ES. We just need to
keep 3 months worth of logs (maximum 6 billiion records). My schema for
each line of syslog is just a timestamp, host-IPaddress and Message
field.. But I definitely need to do reporting (ranking) of busiest
host-IP and top 20 or even 50 log messages. Understanding that the ranking
can sum up into the hundreds of millions (and I only have a 48GB RAM
server), I have read that there is a way to do this off heap
(http://www.elasticsearch.org/blog/disk-based-field-data-a-k-a-doc-values/).
You could start with a mapping like this and then do some testing on a
small subset of data to get a feel for how much heap it is using up when
you run your report queries (using the _cat/nodes API). From there, you
should be able to determine how many nodes + how much RAM per node you will
need.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.