What is the best way to store (changing) attributes?

I want to store and query different attributes per document. Something like
this:

{
name: "doc1",
metadata: [
{ color: "red" },
{ data: [ "value1", "value2", "value3" ] },
{ size: 500 },
{ avail: true },
]
},
...
{
name: "doc4980",
metadata: [
{ otherValues: [ 55, 33 ] },
{ important: true },
]
}

The metadata array may be different for lots of documents, as its entries
will be defined by the user whenever a new attribute is needed.

Using the attribute name as field name (JSON left side) may lead to a high
memory usage, so I put the names to the JSON right side, too. But I think
the following will not work, because of the different types (int, string,
...) of the value (v) field:

"_source" : {
"name": "doc4980",
"metadata":[
{
k: "otherValues",
v: [ 55, 33 ]
},
{
k: "important",
v: true
}
]
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,
Your first structure makes more sense, but I would probably make metadata
an object instead of an array.

Your second approach would work OK if:

  • There is a single way you want to search these values
  • You set up multi-fields on values and analyze each value multiple ways.
    There would likely be hiccups around numbers and dates.

It really comes down to the search requirements on that data, though.

Could you elaborate more on the memory issues you are running into? I'm not
sure why these two structures would have memory profiles that differed by
much.

Thanks,
Paul

On Monday, September 9, 2013 7:55:46 AM UTC-4, joa wrote:

I want to store and query different attributes per document. Something
like this:

{
name: "doc1",
metadata: [
{ color: "red" },
{ data: [ "value1", "value2", "value3" ] },
{ size: 500 },
{ avail: true },
]
},
...
{
name: "doc4980",
metadata: [
{ otherValues: [ 55, 33 ] },
{ important: true },
]
}

The metadata array may be different for lots of documents, as its entries
will be defined by the user whenever a new attribute is needed.

Using the attribute name as field name (JSON left side) may lead to a high
memory usage, so I put the names to the JSON right side, too. But I think
the following will not work, because of the different types (int, string,
...) of the value (v) field:

"_source" : {
"name": "doc4980",
"metadata":[
{
k: "otherValues",
v: [ 55, 33 ]
},
{
k: "important",
v: true
}
]
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi, I rejected the first structure due to this comment on one of my other
questions: https://groups.google.com/d/msg/elasticsearch/pUg9GbDOMf8/QlKPkftm3e4J.
What is the advantage of using an object instead of an array as you
suggested? Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Using an object is simply a more straightforward abstraction, likely, no
practical difference.

From that thread, I see there are two distinctions:

  • Lots of document types leading to higher memory. I personally don't use
    lots of types, so can't really comment one way or the other.
  • Lots of document fields leading to a large number of field names in the
    index. I've had no issue with mappings with ~50 fields and I'd surprised if
    this really became a pain point. This is really just config that gets
    applied during search/indexing and each different field gets it's own
    inverted index(that is likely where most overhead comes from). Have you
    tested around this and had issues?

How many distinct metadata names (keys) do you expect to have? There is no
clear cut number where things will start to have issues, and I'm sure it
depends on amount of resources in the cluster. I wouldn't be concerned at
having less then 100 fields. I'd really test around more than that to see
where things start to take a nose dive.

There are definitely trade-offs with both approaches, but using the nested
k/v approach ends up being more complicated and is the route I would go
when evaling other possibilities fails.

Thanks,
Paul

On Monday, September 9, 2013 5:35:31 PM UTC-4, joa wrote:

Hi, I rejected the first structure due to this comment on one of my other
questions:
https://groups.google.com/d/msg/elasticsearch/pUg9GbDOMf8/QlKPkftm3e4J.
What is the advantage of using an object instead of an array as you
suggested? Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The number of fields can be 1500 (!) and higher. (That comes through e.g.
100 projects and 15 metadata entries per project.) I am trying to find a
robust and scaleable structure, which scales out fine later on.

I think the biggest problem with the k/v approach is, that the values can
only be saved as strings and I cannot do queries like greater or between.
On the other hand, having 1500 (and more) different fields feels wrong?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

If it were me, I would test each setup and to make sure you are running
with the right approach.

To handle your concern with k/v, I would have the value be an object, and
then have string, date, numeric attributes with the appropriate analyzers.
When you index your data, you will need to take a look at the value, and
set the fan out values correctly, eg:

{
k: "NumericData",
v: {
number: 1234
}
}

{
k: "StringData",
v: {
string: "Some user text"
}
}

You'll have to have smarts on the query side to span these fields or target specific ones (dates/numbers). Don't forget to index this and query the k/v setup using the nested documents.

Hope this helps.

Best Regards,

Paul

On Monday, September 9, 2013 6:15:12 PM UTC-4, joa wrote:

The number of fields can be 1500 (!) and higher. (That comes through e.g.
100 projects and 15 metadata entries per project.) I am trying to find a
robust and scaleable structure, which scales out fine later on.

I think the biggest problem with the k/v approach is, that the values can
only be saved as strings and I cannot do queries like greater or between.
On the other hand, having 1500 (and more) different fields feels wrong?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for your help! I've already started testing both variants with dummy
data. Except from how complexity differs from variant to variant, how can I
first test that i am using the most memory-friendly version?

Do I need to compare the different results of
http://localhost:9200/_stats?pretty
and http://localhost:9200/_nodes?all=true&pretty?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.