Need advice how to organize data / schema?

hrvoj3e · May 29, 2020, 1:27pm

I need to index book texts (mainly extracted from images OCR) and search those pages as fulltext.

TLDR: Should I index

each part as doc
or is it better to index records as docs and have dynamic field for pages

Record are books, and parts are pages that have text.

Record A
   page1
   page2
   ...
   pageN
Record B
   page1
   page2
   ...
   pageN

I need to search text field value, aggregate by record_id and display best N hits by score descending.

My model so far is to index each page as separate document and then do aggregation by record_id.

Current mapping

"mappings": {
"properties": {
  "part_id": {
    "type": "integer"
  },
  "record_id": {
    "type": "integer"
  },
  "ri_published": {
    "type": "boolean"
  },
  "rp_visible": {
    "type": "boolean"
  },
  "value": {
    "type": "text"
  }
}
}

Requirements

sort desc result by relevance/score (phrase, lucene dismax, ...)
need to exclude some parts (part_id) if they are not visible (rp_visible=false)
need to page results
(later) need to filter by some metadata (not yet in index)
- need to filter by this metadata sometimes

Solutions that I use

Query 1 (older)

{
  "_source": {
    "excludes": [
      "value"
    ]
  },
  "query": {
    "bool": {
      "must": [
        {
          "match_phrase": {
            "value": {
              "query": "john wayne"
            }
          }
        }
      ]
    }
  },
  "size": 0,
  "aggs": {
    "by_record": {
      "terms": {
        "field": "record_id",
        "order": {
          "by_score_max": "desc"
        }
      },
      "aggs": {
        "by_top_records": {
          "top_hits": {
            "size": 3,
            "highlight": {
              "pre_tags": [
                "[ii]"
              ],
              "post_tags": [
                "[/ii]"
              ],
              "fields": {
                "value": {
                  "number_of_fragments": 5,
                  "fragment_size": 100,
                  "type": "unified",
                  "order": "none",
                  "no_match_size": 100
                }
              }
            },
            "_source": {
              "excludes": [
                "value"
              ]
            }
          }
        },
        "by_score_max": {
          "max": {
            "script": {
              "source": "_score"
            }
          }
        }
      }
    }
  }
}

Query 2 (composite) - using current

{
  "_source": {
    "excludes": [
      "value"
    ]
  },
  "query": {
    "bool": {
      "must": [
        {
          "match_phrase": {
            "value": {
              "query": "john wayne"
            }
          }
        }
      ]
    }
  },
  "size": 0,
  "aggs": {
    "by_record": {
      "composite": {
        "size": 200,
        "sources": {
          "record_id": {
            "terms": {
              "field": "record_id"
            }
          }
        }
      },
      "aggs": {
        "by_record_top": {
          "top_hits": {
            "size": 3,
            "highlight": {
              "pre_tags": [
                "[ii]"
              ],
              "post_tags": [
                "[/ii]"
              ],
              "fields": {
                "value": {
                  "number_of_fragments": 2,
                  "fragment_size": 100,
                  "type": "unified",
                  "order": "none",
                  "no_match_size": 200
                }
              }
            },
            "_source": {
              "excludes": [
                "value"
              ]
            }
          }
        },
        "by_record_max": {
          "max": {
            "script": {
              "source": "_score"
            }
          }
        },
        "agg_bucket_limit": {
          "bucket_sort": {
            "from": 0,
            "size": 20,
            "sort": {
              "by_record_max": {
                "order": "desc"
              }
            }
          }
        }
      }
    }
  }
}

system · June 26, 2020, 1:27pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Help Designing Index for PDF Documents Elasticsearch	10	1036	March 3, 2017
Sorting by _score on an aggregation Elasticsearch	4	384	May 5, 2020
What's a good strategy for getting one or as many document per group depending on the group Elasticsearch	6	1817	December 12, 2017
Optinal way to index and search data Elasticsearch	1	311	April 1, 2021
Design Question - Better to have a summary text field or index each field as keyword and text Elasticsearch	2	279	March 24, 2022

Need advice how to organize data / schema?

Related topics