How many fields is too many?

I'm currently working on implementing ElasticSearch on a Django-based REST
API. I hope to be able to search through roughly 5 million documents, but
I've struggled to find an answer to a question I've had from the beginning:
how many fields is too many for a single indexed object?

My setup has 512MB of storage and 4GB of memory, 1 shard, and 2 nodes.

I want to be able to sort/filter on about 30 different fields for that
single model, but only search on 5-6. Is 30 fields too many?

I have a dev environment set up with roughly 30,000 documents and the same
number of fields, and updates and queries are taking significantly longer
than I had hoped. Updating a single document is taking between 4-5
seconds, and searching for a 5-character long string is taking 3-4 seconds.

Is there hope that this is a configuration problem, or should I reconsider
how many fields I'm using? Thanks in advance.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/bf34715f-aca3-4aa8-a1d5-81e97b87d119%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

On Thu, Apr 16, 2015 at 9:40 AM, Mitch Kuchenberg mitch@getambassador.com
wrote:

I'm currently working on implementing Elasticsearch on a Django-based REST
API. I hope to be able to search through roughly 5 million documents, but
I've struggled to find an answer to a question I've had from the beginning:
how many fields is too many for a single indexed object?

My setup has 512MB of storage and 4GB of memory, 1 shard, and 2 nodes.

I want to be able to sort/filter on about 30 different fields for that
single model, but only search on 5-6. Is 30 fields too many?

We run with about 20 fields and have no trouble:

We have lots more data and lots more machine than you do but I don't see
why it wouldn't scale down.

I have a dev environment set up with roughly 30,000 documents and the same
number of fields, and updates and queries are taking significantly longer
than I had hoped. Updating a single document is taking between 4-5
seconds, and searching for a 5-character long string is taking 3-4 seconds.

Something is up, yeah. Its hard to figure out what might be up from
reading this though. Some questions that are normal to ask here:

  1. Can you post an example document (like as a gist or pastebin or
    whatever)?
  2. Can you post an example query?
  3. How much heap are you giving Elasticsearch?
  4. How much disk is the 30,000 documents taking up (/var/lib/elasticsearch)
    ?
  5. What version are you using?
  6. Do you see IO during the query (iostat -dmx 3 10) ?
  7. Swapping?

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd1XNLjkPPJ%3DuhmX3_B2j2xrr4z1m84zWrppvFQ-%3DNNKFQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

The time required for update depends on the peculiarities of the update
operations, the massive scripting overhead, the refresh operation, and the
segment merge activities that are related.

The number of fields does not matter.

My application has 5000 fields. I avoid updates at all costs. A new
document is faster.

Jörg

On Thu, Apr 16, 2015 at 3:40 PM, Mitch Kuchenberg mitch@getambassador.com
wrote:

I'm currently working on implementing Elasticsearch on a Django-based REST
API. I hope to be able to search through roughly 5 million documents, but
I've struggled to find an answer to a question I've had from the beginning:
how many fields is too many for a single indexed object?

My setup has 512MB of storage and 4GB of memory, 1 shard, and 2 nodes.

I want to be able to sort/filter on about 30 different fields for that
single model, but only search on 5-6. Is 30 fields too many?

I have a dev environment set up with roughly 30,000 documents and the same
number of fields, and updates and queries are taking significantly longer
than I had hoped. Updating a single document is taking between 4-5
seconds, and searching for a 5-character long string is taking 3-4 seconds.

Is there hope that this is a configuration problem, or should I reconsider
how many fields I'm using? Thanks in advance.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/bf34715f-aca3-4aa8-a1d5-81e97b87d119%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/bf34715f-aca3-4aa8-a1d5-81e97b87d119%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH%3DPF6HewBiMGmsKN%3DCA_SH6GNqSvaFhH_uZAHOjMhLbw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

On Thu, Apr 16, 2015 at 10:21 AM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

The time required for update depends on the peculiarities of the update
operations, the massive scripting overhead, the refresh operation, and the
segment merge activities that are related.

The number of fields does not matter.

My application has 5000 fields. I avoid updates at all costs. A new
document is faster.

We can't do that in our application and so have to eat the load for
updates. As far as I can see the biggest cost is in segment merge and the
overhead of tombstoned entries before they are merged out. Still, with
30,000 documents none of that is likely to be a big deal.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd37ZXoH_xMtpJSFxtkpJ2zSSAq-FYqLDXNZ3Hv6%2BLFHRg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hey Nik, you'll have to forgive me if any of my answers don't make sense.
I've only been familiar with Elasticsearch for about a week.

  1. Here's a template for my documents:
    https://gist.github.com/mkuchen/d71de53a80e078242af9
  2. I interact with my search engine through django-haystack
    http://django-haystack.readthedocs.org/en/latest/. A query may look like
    SearchQuerySet().filter(document='mitch').order_by('created_at')[:100] -- so essentially getting the first 100 documents that have mitchin them, ordered by the fieldcreated_at`.
  3. Each node has 247.5MB of heap allocated judging by my hosting service's
    dashboard.
  4. The documents/fields take up roughly 30MB on disk.
  5. Using Elasticsearch version 1.4.2 but could very easily upgrade.
  6. I'm hosting with found.no so I don't have access to a command line to
    run that unfortunately.
  7. I haven't found any options in found.no to disable swapping, so I would
    assume they have it off by default? I could be wrong though.

Thanks for your reply.

On Thursday, April 16, 2015 at 10:06:39 AM UTC-4, Nikolas Everett wrote:

On Thu, Apr 16, 2015 at 9:40 AM, Mitch Kuchenberg <mi...@getambassador.com
<javascript:>> wrote:

I'm currently working on implementing Elasticsearch on a Django-based
REST API. I hope to be able to search through roughly 5 million documents,
but I've struggled to find an answer to a question I've had from the
beginning: how many fields is too many for a single indexed object?

My setup has 512MB of storage and 4GB of memory, 1 shard, and 2 nodes.

I want to be able to sort/filter on about 30 different fields for that
single model, but only search on 5-6. Is 30 fields too many?

We run with about 20 fields and have no trouble:
https://en.wikipedia.org/wiki/Field_(mathematics)?action=cirrusDump

We have lots more data and lots more machine than you do but I don't see
why it wouldn't scale down.

I have a dev environment set up with roughly 30,000 documents and the
same number of fields, and updates and queries are taking significantly
longer than I had hoped. Updating a single document is taking between 4-5
seconds, and searching for a 5-character long string is taking 3-4 seconds.

Something is up, yeah. Its hard to figure out what might be up from
reading this though. Some questions that are normal to ask here:

  1. Can you post an example document (like as a gist or pastebin or
    whatever)?
  2. Can you post an example query?
  3. How much heap are you giving Elasticsearch?
  4. How much disk is the 30,000 documents taking up
    (/var/lib/elasticsearch) ?
  5. What version are you using?
  6. Do you see IO during the query (iostat -dmx 3 10) ?
  7. Swapping?

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2bbda55e-c8f3-4f7f-bdad-875ee7bfe575%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

On Thu, Apr 16, 2015 at 10:54 AM, Mitch Kuchenberg mitch@getambassador.com
wrote:

Hey Nik, you'll have to forgive me if any of my answers don't make sense.
I've only been familiar with Elasticsearch for about a week.

  1. Here's a template for my documents:
    https://gist.github.com/mkuchen/d71de53a80e078242af9

This is pretty useless to me. You'd need to show me a fully expanded
version in JSON.

  1. I interact with my search engine through django-haystack
    http://django-haystack.readthedocs.org/en/latest/. A query may look
    like
    SearchQuerySet().filter(document='mitch').order_by('created_at')[:100] -- so essentially getting the first 100 documents that have mitchin them, ordered by the fieldcreated_at`.

Its always best when talking about things like this to do it with curl
posting JSON because curl is our common language.

  1. Each node has 247.5MB of heap allocated judging by my hosting
    service's dashboard.

Sorry, what is the value of -Xmx parameter you used to run Elasticsearch.
The actual amount of of heap in use at a time isn't usually useful.

  1. The documents/fields take up roughly 30MB on disk.

These should be instant.

  1. Using Elasticsearch version 1.4.2 but could very easily upgrade.

Its cool.

  1. I'm hosting with found.no so I don't have access to a command line to
    run that unfortunately.
  2. I haven't found any options in found.no to disable swapping, so I
    would assume they have it off by default? I could be wrong though.

I think you should take this up with found.no. Maybe try running
elasticsearch locally and comparing. In general 30mb of index should be
instantly searchable unless Elasticsearch is swapped out or the linux disk
cache is cold and the io subsystem is amazingly garbage. Or something else
weird like that.

The last question that I forgot to ask was what your mapping is.

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd2QMOEjh926mEzvngDPHD-O%2BV9rWLKb9vXj%3DDHAmwLs7g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.