How to design ES for user-defined custom fields

Let's say I have an index called "Contacts"

Any contact belongs to a user, it'll have custom variables like this:

{
  user_id: 684498,
  name: "John Doe",
  custom_age: 27,
  custom_hobbies: ['painting'],
  custom_source: 'flyers'
}

Users should be able to drill down by custom variables. Each user can have from a few contacts to 1M contacts.

There's a potential for unlimited custom variables per contact, each user uses different custom variables.

I was thinking of increasing the index.mapping.total_fields.limit which is 2000 by default - but that will never be enough, if you sum up the number of different custom fields between different users, it'd amount to hundreds of thousands if not millions.

I was also thinking of creating an index per user, but I heard there's performance concerns.

How can we do this right?

If you share that as you proposed within the same index, imagine the following situation.

A user A index:

{
  user_id: 684498,
  name: "John Doe",
  custom_foo: "2018-08-01"
}

Then the user B index:

{
  user_id: 684498,
  name: "John Doe",
  custom_foo: "Hello world"
}

:boom: As you can imagine. That is not going to work.

You could think also of the following structure:

{
  user_id: 684498,
  name: "John Doe",
  custom: [ {
    "name": "foo",
    "value": "Hello world"
  } ]
}

But this will have some consequences:

  • You must use nested type which is coming with some drawbacks
  • You can't have real types for your fields as everything will be indexed as text

If you can know what are the data types used by your user, you can think of something like:

{
  user_id: 684498,
  name: "John Doe",
  custom_text: [ {
    "name": "foo",
    "value": "Hello world"
  } ],
  custom_date: [ {
    "name": "foo",
    "value": "2018-08-02"
  } ]
}

Or like this:

{
  user_id: 684498,
  name: "John Doe",
  custom: {
    dt_foo: "2018-08-01",
    st_foo: "Hello world"
  }
}

Where dt and st are prefix for dates and strings which can be used by the dynamic templates.

If you can't know what the users will enter and leave that entirely to them and that you need to index this data (otherwise you can just ask elasticsearch to ignore anything under custom), the I believe the best choice is to have one index per user with the drawbacks you are already aware of.

Not an easy decision to make but I hope this will gives you some clues.

Thanks @dadoonet - this is helpful.

What about an index per user method - is it practical when you have an abundance of users? (100k-1M users)

If it's not, can the ElasticCloud's PaaS handle that amount of indexes even though a lot of them will hold little data?

No. I believe this strategy might require much more nodes.

But may be @jpountz can share his thoughts on a Lucene level.

Create field_0_int, field_0_date, field_0_string etc, and allow each user to have, say, 500 custom fields for each type, and create another table/index to map custom fields to the names for each user (or just a field in the current users table) and it's done.

I don't think the elastic service doesn't have some magic sauce code to support billion indexes.

I see, thanks for the responsiveness!

What is the average/rough overhead of an empty index with like 10 mapping fields? because I heard the overhead is linear in other forums.

I'm just trying to understand a rough idea about the cost/scalability if we went with that idea.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.