Indices size


(Jérôme Gagnon) #1

Good Morning,

In my quest to switch from Sphinx to ElasticSearch again, I have found that
the size on disk of the indices is about 4x time bigger for our
ElasticSearch compared to our Sphinx files. The actual size I have is 82gb
for 166M documents or about 2000doc/mb. In Sphinx we were able to store
about 8000doc/mb. I'm a little worried about IO usage on my node disks
about those large files. Plus I found this (kind of old) article;
http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/saying that Lucene indices are smaller than Sphinx one ? Does anyone have
and ideas why ElasticSearch indices would be that much bigger than Sphinx
(and Lucene) one ?

I already have
"_all" : {"enabled" : false}
"_source":{"enabled":false}
and I'm storing only 4 fields; 3 long and 1 integer

The biggest files are .frq and .tis as the part one of my index shows;

453M -rw-r--r-- 1 root root 453M 2012-10-19 11:28 _nc.fdt
52M -rw-r--r-- 1 root root 52M 2012-10-19 11:28 _nc.fdx
4.0K -rw-r--r-- 1 root root 204 2012-10-19 11:28 _nc.fnm
1.8G -rw-r--r-- 1 root root 1.8G 2012-10-19 11:39 _nc.frq
32M -rw-r--r-- 1 root root 32M 2012-10-19 11:39 _nc.nrm
300M -rw-r--r-- 1 root root 300M 2012-10-19 11:39 _nc.prx
820K -rw-r--r-- 1 root root 818K 2012-10-19 16:30 _nc_t4.del
7.3M -rw-r--r-- 1 root root 7.3M 2012-10-19 11:39 _nc.tii
598M -rw-r--r-- 1 root root 598M 2012-10-19 11:39 _nc.tis

--


(Igor Motov) #2

Have you deleted a lot of documents from elasticsearch and reindexed them
again? Could you run

curl -XPOST
'http://localhost:9200/your-index/_optimize?only_expunge_deletes=true'

on your index and see if it will reduce the index size.

Which version of elasticsearch are you using?

On Monday, October 22, 2012 10:24:22 AM UTC-4, Jérôme Gagnon wrote:

Good Morning,

In my quest to switch from Sphinx to ElasticSearch again, I have found
that the size on disk of the indices is about 4x time bigger for our
ElasticSearch compared to our Sphinx files. The actual size I have is 82gb
for 166M documents or about 2000doc/mb. In Sphinx we were able to store
about 8000doc/mb. I'm a little worried about IO usage on my node disks
about those large files. Plus I found this (kind of old) article;
http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/saying that Lucene indices are smaller than Sphinx one ? Does anyone have
and ideas why ElasticSearch indices would be that much bigger than Sphinx
(and Lucene) one ?

I already have
"_all" : {"enabled" : false}
"_source":{"enabled":false}
and I'm storing only 4 fields; 3 long and 1 integer

The biggest files are .frq and .tis as the part one of my index shows;

453M -rw-r--r-- 1 root root 453M 2012-10-19 11:28 _nc.fdt
52M -rw-r--r-- 1 root root 52M 2012-10-19 11:28 _nc.fdx
4.0K -rw-r--r-- 1 root root 204 2012-10-19 11:28 _nc.fnm
1.8G -rw-r--r-- 1 root root 1.8G 2012-10-19 11:39 _nc.frq
32M -rw-r--r-- 1 root root 32M 2012-10-19 11:39 _nc.nrm
300M -rw-r--r-- 1 root root 300M 2012-10-19 11:39 _nc.prx
820K -rw-r--r-- 1 root root 818K 2012-10-19 16:30 _nc_t4.del
7.3M -rw-r--r-- 1 root root 7.3M 2012-10-19 11:39 _nc.tii
598M -rw-r--r-- 1 root root 598M 2012-10-19 11:39 _nc.tis

--


(Stéphane Raux) #3

Hi,

When you index a numeric field, Lucene actually stores several
versions of the data in order to optimize range and sort operations :
http://lucene.apache.org/core/old_versioned_docs/versions/2_9_0/api/all/org/apache/lucene/document/NumericField.html

You can disable this behaviour by using the 'precision_step' parameter
in your mapping.

Hope that helps,

Stéphane

2012/10/22 Igor Motov imotov@gmail.com:

Have you deleted a lot of documents from elasticsearch and reindexed them
again? Could you run

curl -XPOST
'http://localhost:9200/your-index/_optimize?only_expunge_deletes=true'

on your index and see if it will reduce the index size.

Which version of elasticsearch are you using?

On Monday, October 22, 2012 10:24:22 AM UTC-4, Jérôme Gagnon wrote:

Good Morning,

In my quest to switch from Sphinx to ElasticSearch again, I have found
that the size on disk of the indices is about 4x time bigger for our
ElasticSearch compared to our Sphinx files. The actual size I have is 82gb
for 166M documents or about 2000doc/mb. In Sphinx we were able to store
about 8000doc/mb. I'm a little worried about IO usage on my node disks about
those large files. Plus I found this (kind of old) article;
http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/
saying that Lucene indices are smaller than Sphinx one ? Does anyone have
and ideas why ElasticSearch indices would be that much bigger than Sphinx
(and Lucene) one ?

I already have
"_all" : {"enabled" : false}
"_source":{"enabled":false}
and I'm storing only 4 fields; 3 long and 1 integer

The biggest files are .frq and .tis as the part one of my index shows;

453M -rw-r--r-- 1 root root 453M 2012-10-19 11:28 _nc.fdt
52M -rw-r--r-- 1 root root 52M 2012-10-19 11:28 _nc.fdx
4.0K -rw-r--r-- 1 root root 204 2012-10-19 11:28 _nc.fnm
1.8G -rw-r--r-- 1 root root 1.8G 2012-10-19 11:39 _nc.frq
32M -rw-r--r-- 1 root root 32M 2012-10-19 11:39 _nc.nrm
300M -rw-r--r-- 1 root root 300M 2012-10-19 11:39 _nc.prx
820K -rw-r--r-- 1 root root 818K 2012-10-19 16:30 _nc_t4.del
7.3M -rw-r--r-- 1 root root 7.3M 2012-10-19 11:39 _nc.tii
598M -rw-r--r-- 1 root root 598M 2012-10-19 11:39 _nc.tis

--

--


(Jérôme Gagnon) #4

Tried optimize things, did near to nothing to the size... Upgraded to
0.20.RC1 removed frequencies on some fields and played with precision_step
on low cardinality fields, but I think that there is only peanuts to win
with precision_step

On Monday, October 22, 2012 10:24:22 AM UTC-4, Jérôme Gagnon wrote:

Good Morning,

In my quest to switch from Sphinx to ElasticSearch again, I have found
that the size on disk of the indices is about 4x time bigger for our
ElasticSearch compared to our Sphinx files. The actual size I have is 82gb
for 166M documents or about 2000doc/mb. In Sphinx we were able to store
about 8000doc/mb. I'm a little worried about IO usage on my node disks
about those large files. Plus I found this (kind of old) article;
http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/saying that Lucene indices are smaller than Sphinx one ? Does anyone have
and ideas why ElasticSearch indices would be that much bigger than Sphinx
(and Lucene) one ?

I already have
"_all" : {"enabled" : false}
"_source":{"enabled":false}
and I'm storing only 4 fields; 3 long and 1 integer

The biggest files are .frq and .tis as the part one of my index shows;

453M -rw-r--r-- 1 root root 453M 2012-10-19 11:28 _nc.fdt
52M -rw-r--r-- 1 root root 52M 2012-10-19 11:28 _nc.fdx
4.0K -rw-r--r-- 1 root root 204 2012-10-19 11:28 _nc.fnm
1.8G -rw-r--r-- 1 root root 1.8G 2012-10-19 11:39 _nc.frq
32M -rw-r--r-- 1 root root 32M 2012-10-19 11:39 _nc.nrm
300M -rw-r--r-- 1 root root 300M 2012-10-19 11:39 _nc.prx
820K -rw-r--r-- 1 root root 818K 2012-10-19 16:30 _nc_t4.del
7.3M -rw-r--r-- 1 root root 7.3M 2012-10-19 11:39 _nc.tii
598M -rw-r--r-- 1 root root 598M 2012-10-19 11:39 _nc.tis

--


(system) #5