Storing table-like data in Elastic Search

Hi!

I'm a ElasticSearch newbie and approaching the following project - I have
many millions (100+ and counting) of tables in JSON format like this:

{ "title" : "dogs species",
"col_names" : [ "name", "description", "country_of_origin" ],
"rows" : { [ "Boxer", "good dog", "Germany" ],
[ "Irish Setter", "great dog", "Irland" ]
}
}

{ "title" : "Classmates",
"col_names" : [ "name", "class", "age", "avg_grade" ],
"rows" : { [ "Alice", "A", "14", "85" ],
[ "Bob", "B", "15", "91" ]
}

{ "title" : "Misc stuff",
"col_names" : [ "foo" ]
"rows" : { [ "Setter is impotant" ],
[ "Irland is green" ]
}

I.e. tons of completely unrelated structural data. My goal is to make it
searchable.
My search requirements are:

  • Just search for text inside the cells over all tables. I.e. searching
    for "Boxer Irland" should find the first and last table above
  • Maching withing the row should have a higher score. I.e. searching
    for "Irland Setter" should give the first table the higher score in results
  • Its also important to somehow preseve the data structure of each
    table, so it could be fetched and converted back to structured JSON

Any ideas on how to approach this problem?

Thank you all very much in advance.

Zaar

--

Hi,

If you can transform your JSON just a little bit, you will be able to throw
is at ES almost as is. Have a look at JSON on,
say, Elasticsearch Platform — Find real-time answers at scale | Elastic to see
what I mean.

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Thursday, November 15, 2012 4:24:22 PM UTC-5, Zaar Hai wrote:

Hi!

I'm a Elasticsearch newbie and approaching the following project - I have
many millions (100+ and counting) of tables in JSON format like this:

{ "title" : "dogs species",
"col_names" : [ "name", "description", "country_of_origin" ],
"rows" : { [ "Boxer", "good dog", "Germany" ],
[ "Irish Setter", "great dog", "Irland" ]
}
}

{ "title" : "Classmates",
"col_names" : [ "name", "class", "age", "avg_grade" ],
"rows" : { [ "Alice", "A", "14", "85" ],
[ "Bob", "B", "15", "91" ]
}

{ "title" : "Misc stuff",
"col_names" : [ "foo" ]
"rows" : { [ "Setter is impotant" ],
[ "Irland is green" ]
}

I.e. tons of completely unrelated structural data. My goal is to make it
searchable.
My search requirements are:

  • Just search for text inside the cells over all tables. I.e.
    searching for "Boxer Irland" should find the first and last table above
  • Maching withing the row should have a higher score. I.e. searching
    for "Irland Setter" should give the first table the higher score in results
  • Its also important to somehow preseve the data structure of each
    table, so it could be fetched and converted back to structured JSON

Any ideas on how to approach this problem?

Thank you all very much in advance.

Zaar

--

Hi Otis,

Are you sure you gave me the right link?

Can you please be more specific regarding how to transform? - pull all of
the cell data as a single field?, but then I'll lost the table structure...

Zaar.

On Friday, November 16, 2012 7:10:38 AM UTC+2, Otis Gospodnetic wrote:

Hi,

If you can transform your JSON just a little bit, you will be able to
throw is at ES almost as is. Have a look at JSON on, say,
Elasticsearch Platform — Find real-time answers at scale | Elastic to see what
I mean.

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Thursday, November 15, 2012 4:24:22 PM UTC-5, Zaar Hai wrote:

Hi!

I'm a Elasticsearch newbie and approaching the following project - I have
many millions (100+ and counting) of tables in JSON format like this:

{ "title" : "dogs species",
"col_names" : [ "name", "description", "country_of_origin" ],
"rows" : { [ "Boxer", "good dog", "Germany" ],
[ "Irish Setter", "great dog", "Irland" ]
}
}

{ "title" : "Classmates",
"col_names" : [ "name", "class", "age", "avg_grade" ],
"rows" : { [ "Alice", "A", "14", "85" ],
[ "Bob", "B", "15", "91" ]
}

{ "title" : "Misc stuff",
"col_names" : [ "foo" ]
"rows" : { [ "Setter is impotant" ],
[ "Irland is green" ]
}

I.e. tons of completely unrelated structural data. My goal is to make it
searchable.
My search requirements are:

  • Just search for text inside the cells over all tables. I.e.
    searching for "Boxer Irland" should find the first and last table above
  • Maching withing the row should have a higher score. I.e. searching
    for "Irland Setter" should give the first table the higher score in results
  • Its also important to somehow preseve the data structure of each
    table, so it could be fetched and converted back to structured JSON

Any ideas on how to approach this problem?

Thank you all very much in advance.

Zaar

--

For a csv table liek this one:

key1, key2, key3
val1, val2, val3
val11, val12, val13

this csv should be transformed into

[{
"key1": "val1",
"key2": "val2",
"key3": "val3",
},{
"key1": "val11",
"key2": "val12",
"key3": "val13",
}]

to be sent to Elasticsearch.

2012/11/16 Zaar Hai haizaar@gmail.com

Hi Otis,

Are you sure you gave me the right link?

Can you please be more specific regarding how to transform? - pull all of
the cell data as a single field?, but then I'll lost the table structure...

Zaar.

On Friday, November 16, 2012 7:10:38 AM UTC+2, Otis Gospodnetic wrote:

Hi,

If you can transform your JSON just a little bit, you will be able to
throw is at ES almost as is. Have a look at JSON on, say,
http://www.elasticsearch.org/guide/reference/api/index_.htmlhttp://www.elasticsearch.org/guide/reference/api/index_.htmlto see what I mean.

Otis

Search Analytics - http://sematext.com/search-**analytics/index.htmlhttp://sematext.com/search-analytics/index.html
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Servicehttp://sematext.com/spm/index.html

On Thursday, November 15, 2012 4:24:22 PM UTC-5, Zaar Hai wrote:

Hi!

I'm a Elasticsearch newbie and approaching the following project - I
have many millions (100+ and counting) of tables in JSON format like this:

{ "title" : "dogs species",
"col_names" : [ "name", "description", "country_of_origin" ],
"rows" : { [ "Boxer", "good dog", "Germany" ],
[ "Irish Setter", "great dog", "Irland" ]
}
}

{ "title" : "Classmates",
"col_names" : [ "name", "class", "age", "avg_grade" ],
"rows" : { [ "Alice", "A", "14", "85" ],
[ "Bob", "B", "15", "91" ]
}

{ "title" : "Misc stuff",
"col_names" : [ "foo" ]
"rows" : { [ "Setter is impotant" ],
[ "Irland is green" ]
}

I.e. tons of completely unrelated structural data. My goal is to make it
searchable.
My search requirements are:

  • Just search for text inside the cells over all tables. I.e.
    searching for "Boxer Irland" should find the first and last table above
  • Maching withing the row should have a higher score. I.e.
    searching for "Irland Setter" should give the first table the higher score
    in results
  • Its also important to somehow preseve the data structure of each
    table, so it could be fetched and converted back to structured JSON

Any ideas on how to approach this problem?

Thank you all very much in advance.

Zaar

--

--

this csv should be transformed into

[{
"key1": "val1",
"key2": "val2",

"key3": "val3",

},{
"key1": "val11",
"key2": "val12",

"key3": "val13",

}]

I agree that the above would be ideal, but Zaar has explained that his
data has arbitrary column names, so he may end up with massive mappings.

Zaar: your documents would be indexed like this:

{ "title" : [dogs, species],
"col_names" : [ name, description, country_of_origin ],
"rows": [ boxer, good, dog, germany, irish, setter, great, dog, irland]
}

so you can search the "rows" field for any of those terms, and you can
use a match_phrase query with a highish "slop" value to boost terms that
are closer together, eg "boxer good" would score higher than "boxer dog"

clint

--

Clinton,
so if I understand you correctly, you suggest transforming my first example
to the following:

{
"title": "dogs species",
"col_names": "[ "name", "description", "country_of_origin" ]",
"rows": "[ [ "Boxer", "good dog", "Germany" ], [ "Irish
Setter", "great dog", "Irland" ] ]"
}

That actually makes sence! I've throwed it in and default analyzer does a
good job harversing words only:
curl -s -X GET "http://localhost:9200/test2/_analyze?pretty=true" -d '[ [
"Boxer", "good dog", "Germany" ], [ "Irish Setter", "great dog",
"Irland" ] ]' | grep token
"tokens" : [ {
"token" : "boxer",
"token" : "good",
"token" : "dog",
"token" : "germany",
"token" : "irish",
"token" : "setter",
"token" : "great",
"token" : "dog",
"token" : "irland",

It also stores valid json lists as values of "rows" and "col_names" fields,
so I can easyly construct a structured data from the query results.

Thank you!
Zaar

On Friday, November 16, 2012 5:03:27 PM UTC+2, Clinton Gormley wrote:

this csv should be transformed into

[{
"key1": "val1",
"key2": "val2",

"key3": "val3",

},{
"key1": "val11",
"key2": "val12",

"key3": "val13",

}]

I agree that the above would be ideal, but Zaar has explained that his
data has arbitrary column names, so he may end up with massive mappings.

Zaar: your documents would be indexed like this:

{ "title" : [dogs, species],
"col_names" : [ name, description, country_of_origin ],
"rows": [ boxer, good, dog, germany, irish, setter, great, dog, irland]
}

so you can search the "rows" field for any of those terms, and you can
use a match_phrase query with a highish "slop" value to boost terms that
are closer together, eg "boxer good" would score higher than "boxer dog"

clint

--

Actually I've spotted that Elastic search supports arrays (but not tested
array).
So I've came up with the model as below.

However I'm now strugging how to give priority to the matching from the
same row. I.e. currently text search for "Irland Setter" gives second
document much higher score (0.21 and 0.13 respectively). I need the first
document to have a higher score because it has both "Irand" and "Setter" in
the same row.

{ "title": "dogs species",
"col_names": [ "name", "description", "country_of_origin" ],
"rows": [
{ "row": [ "Boxer", "good dog", "Germany" ] },
{ "row": [ "Irish Setter", "great dog", "Irland" ] }
]
}
{ "title": "Misc stuff",
"col_names": [ "foo" ],
"rows": [
{ "row": [ "Setter is impotant" ] },
{ "row": [ "Irland is green" ] }
]
}

Zaar

On Friday, November 16, 2012 7:42:25 PM UTC+2, Zaar Hai wrote:

Clinton,
so if I understand you correctly, you suggest transforming my first
example to the following:

{
"title": "dogs species",
"col_names": "[ "name", "description", "country_of_origin" ]",
"rows": "[ [ "Boxer", "good dog", "Germany" ], [ "Irish
Setter", "great dog", "Irland" ] ]"
}

That actually makes sence! I've throwed it in and default analyzer does a
good job harversing words only:
curl -s -X GET "http://localhost:9200/test2/_analyze?pretty=true" -d '[
[ "Boxer", "good dog", "Germany" ], [ "Irish Setter", "great
dog", "Irland" ] ]' | grep token
"tokens" : [ {
"token" : "boxer",
"token" : "good",
"token" : "dog",
"token" : "germany",
"token" : "irish",
"token" : "setter",
"token" : "great",
"token" : "dog",
"token" : "irland",

It also stores valid json lists as values of "rows" and "col_names"
fields, so I can easyly construct a structured data from the query results.

Thank you!
Zaar

On Friday, November 16, 2012 5:03:27 PM UTC+2, Clinton Gormley wrote:

this csv should be transformed into

[{
"key1": "val1",
"key2": "val2",

"key3": "val3",

},{
"key1": "val11",
"key2": "val12",

"key3": "val13",

}]

I agree that the above would be ideal, but Zaar has explained that his
data has arbitrary column names, so he may end up with massive mappings.

Zaar: your documents would be indexed like this:

{ "title" : [dogs, species],
"col_names" : [ name, description, country_of_origin ],
"rows": [ boxer, good, dog, germany, irish, setter, great, dog, irland]
}

so you can search the "rows" field for any of those terms, and you can
use a match_phrase query with a highish "slop" value to boost terms that
are closer together, eg "boxer good" would score higher than "boxer dog"

clint

--

However I'm now strugging how to give priority to the matching from
the same row. I.e. currently text search for "Irland Setter" gives
second document much higher score (0.21 and 0.13 respectively).

First, you're experimenting with very few documents (I assume) which
means that your terms are unevenly distributed across your shards. For
testing purposes, I would either add "search_type=dfs_query_then_fetch"
to your search query string, or I would create a test index with only 1
shard.

I need the first document to have a higher score because it has both
"Irand" and "Setter" in the same row.

Use the match_phrase query with a high "slop" value, eg:

{ "query": {
    "match_phrase": {
        "row": { 
             "query": "irland setter",
             "slop":  100
         }
    }
}}

This will incorporate token distance into the relevance calculation.

Also, when you're indexing arrays of analyzed strings, it may be worth
setting the position_offset_gap in the mapping.

If you index ["quick brown", "fox"], by default it would be indexed as:

  • position 1 : quick
  • position 2 : brown
  • position 3 : fox

If you set the positon_offset_gap, ie map the "row" field as:

{ type: "string", position_offset_gap: 100 }

it would be indexed as:

  • position 1 : quick
  • position 2 : brown
  • position 103 : fox

This of course depends on what you are trying to achieve with your data.

clint

--

1 Like

On Saturday, November 17, 2012 2:24:02 PM UTC+2, Clinton Gormley wrote:

First, you're experimenting with very few documents (I assume) which
means that your terms are unevenly distributed across your shards. For
testing purposes, I would either add "search_type=dfs_query_then_fetch"
to your search query string, or I would create a test index with only 1
shard.

Yes, I'm currently experimenting just with two documents to make sure I'm
on the right track. I've recreated them on a single shard following your
advice

I need the first document to have a higher score because it has both
"Irand" and "Setter" in the same row.

Use the match_phrase query with a high "slop" value, eg:

{ "query": { 
    "match_phrase": { 
        "row": { 
             "query": "irland setter", 
             "slop":  100 
         } 
    } 
}} 

This does not help. The "wrong" (second) document still gets much higher
score.
I think its because after analysis, the first document looks like:
"boxer", "good", "dog", "germany", "irish", "setter", "great", "dog",
"irland"
And the second:
"setter", "important", "irland", "green"

So in the second document the "setter" is actually closer to "irland" then
in the first one.

Also, when you're indexing arrays of analyzed strings, it may be worth
setting the position_offset_gap in the mapping.

If you index ["quick brown", "fox"], by default it would be indexed as:

  • position 1 : quick
  • position 2 : brown
  • position 3 : fox

If you set the positon_offset_gap, ie map the "row" field as:

{ type: "string", position_offset_gap: 100 }

it would be indexed as:

  • position 1 : quick
  • position 2 : brown
  • position 103 : fox

This of course depends on what you are trying to achieve with your data.

This looks like an interesting approach. However I need gaps between rows
and not between row members.
Strangely enough, changing mapping for "row" as you've suggested, caused no
results at all.
Also running an analyzer shows that position_offset_gap is disregarded
completely.

Here is my query:
{
"query": {
"match_phrase": {
"row": {
"query": "setter ireland", "slop":100
}
}
}
}

And here is my mapping:
{
"table" : {
"properties" : {
"title" : {"type" : "string"},
"col_names" : {"type" : "string"},
"rows" : {
"properties" : {
"row" : { "type" : "string", "position_offset_gap" :
100 }
}
}
}
}
}

Thank you very much for your help and time!
Zaar

clint

--

This does not help. The "wrong" (second) document still gets much
higher score.
I think its because after analysis, the first document looks like:
"boxer", "good", "dog", "germany", "irish", "setter", "great",
"dog", "irland"
And the second:
"setter", "important", "irland", "green"

Ah right, yes. And probably the fact that that row is shorter makes it
appear to be more relevant. You could try setting omit_norms to true,
to ignore field length normalization.

This looks like an interesting approach. However I need gaps between
rows and not between row members.

True, sorry!

you may want to try an approach where you make "rows" type "nested". So
that would store each "row" as a separate sub-document, which you could
query individually.

Then you can also add {include_in_root: true} to the "rows" mapping, so
that all the data would also be indexed in the root document.

I've put together a demo here:

clint

--

On Saturday, November 17, 2012 5:11:50 PM UTC+2, Clinton Gormley wrote:

This looks like an interesting approach. However I need gaps between
rows and not between row members.

True, sorry!

you may want to try an approach where you make "rows" type "nested". So
that would store each "row" as a separate sub-document, which you could
query individually.

Then you can also add {include_in_root: true} to the "rows" mapping, so
that all the data would also be indexed in the root document.

I've put together a demo here:

Nested documents · GitHub

Wow! It works! Thank you very much!

Just couples of follow up questions:

  1. Would "include_in_root" make the rows data to be effectively storedand
    indexed twice?

  2. I've tried disabling the "include_in_root" - it make the second document
    not to appear in the results. This is because we searching for phrase and
    there is not single row that contains both "irland" and "setter" and
    since I've disabled storing all of the rows as a one document in root, it
    basically filters out the second document. Do I understand it right?

  3. You query consist of two parts - "nested" and the second one. The first
    one searches through each and every row as a single document, and the
    second one searches through all of the rows as a whole for each table,
    right?

Thank you again for helping me with the first steps into the Elasticsearch.
That really made a difference!

Zaar

clint

--

Hiya

Just couples of follow up questions:

  1. Would "include_in_root" make the rows data to be effectively stored
    and indexed twice?

Indexed twice, yes, not stored twice. But given that this is indexed as
an inverted index, this doesn't add much to the index size. I ran a
small test and include_in_root made the index 3% bigger.

  1. I've tried disabling the "include_in_root" - it make the second
    document not to appear in the results. This is because we searching
    for phrase and there is not single row that contains both "irland" and
    "setter" and since I've disabled storing all of the rows as a one
    document in root, it basically filters out the second document. Do I
    understand it right?

Correct

  1. You query consist of two parts - "nested" and the second one. The
    first one searches through each and every row as a single document,
    and the second one searches through all of the rows as a whole for
    each table, right?

yes, because when the terms are stored in the root object, they are
flattened into the 'rows.row' field, instead of being stored as separate
docs (as in the nested version)

Thank you again for helping me with the first steps into the
Elasticsearch. That really made a difference!

glad to hear it :slight_smile:

clint

--

On 11/16/2012 7:03 AM, Clinton Gormley wrote:

I agree that the above would be ideal, but Zaar has explained that his
data has arbitrary column names, so he may end up with massive mappings.

Zaar: your documents would be indexed like this:

{ "title" : [dogs, species],
"col_names" : [ name, description, country_of_origin ],
"rows": [ boxer, good, dog, germany, irish, setter, great, dog, irland]
}

so you can search the "rows" field for any of those terms, and you can
use a match_phrase query with a highish "slop" value to boost terms that
are closer together, eg "boxer good" would score higher than "boxer dog"

clint
Wow, I never knew how an array would score for distance! I had guessed
each one term array entry would by at position 0, just like indexing two
synonyms at the same position in a stream of analyzed terms in Lucene.
Where in Lucene or Elasticsearch is it stated otherwise, so that a
phrase query would do as you say? Are possibility thinking of the case
only where all these words end up in the "all" field?

-Paul

--

Wow, I never knew how an array would score for distance! I had guessed
each one term array entry would by at position 0, just like indexing two
synonyms at the same position in a stream of analyzed terms in Lucene.
Where in Lucene or Elasticsearch is it stated otherwise, so that a
phrase query would do as you say? Are possibility thinking of the case
only where all these words end up in the "all" field?

No, not talking about the _all field.

here's a demo:

clint

--

On 11/27/2012 12:46 AM, Clinton Gormley wrote:

No, not talking about the _all field.
here's a demo:

Array positions in Elasticsearch · GitHub

clint

Nice! That is good to see.
I'm sure I will forget it when I need to use it, because I don't see
mention in the documentation or the Java docs of any
position_offset_gap, so never knew to even consider it.

Maybe on the page.

The docs would need to include the default value (0?) and some
understanding of how index array entries
(0,1,2..) relate to analyzing a string and the Lucene offset and
position information (optionally) stored about each token in a field.
Apparently this attribute position_offset_gap refers to the word
position of the term. I think in Lucene to prevent overloading the term
"offset" they used the term increment (Yikes now I'm using muliple
meaning of the term "term"). I also see that one of the original
requests refered to the Lucene
positionIncrementGap

Something to the effect:
When storing offsets (see term_vector
Elasticsearch Platform — Find real-time answers at scale | Elastic)
each instance in the array of values for a field is stored with a word
position ? (0?). You can define the difference in position between
each member of an array using:
position_offset_gap
[insert clint's example here]
or Igor's recent example from the thread
http://elasticsearch-users.115913.n3.nabble.com/Search-in-the-same-phrase-td4025826.html

If the field is analyzed into multiple terms the position gap is from
the position of the last term of the previous instance of the field to
the 1st term of the next item.

But the above doesn't feel clear enough.

-Paul

--