Index Design

I have a one to many data set where a unique identifier could have between 1 and 5 attributes about that document and additional attributes about that attribute. My initial feeling is just to flatten it and run match queries against the 5 columns.

some data:

id, g1, c1,  g2, c2, g3, c3
1234, text file, 0.2, csv file, 0.8, tsv file, 0.1

where 'g' stands for guess and 'c' stands for confidence. I want to be able to query for rows that have a 80% chance of being a csv file.

{
	"query": {
		"bool": {
			"must": [{
				"range": {
					"c1": {
						"gte": 0.8
					}
				}
			}, {
				"range": {
					"c2": {
						"gte": 0.8
					}
				}
			}, {
				"range": {
					"c3": {
						"gte": 0.8
					}
				}
			}],
			"should": [{
				"match": {
					"g1": "csv file"
				}
			}, {
				"match": {
					"g2": "csv file"
				}
			}, {
				"match": {
					"g3": "csv file"
				}
			}]
		}
	}
}

is there something I can do better here? is this too naive?

That seems perfectly reasonable to me :slight_smile:

If you wanted a bit more structure, you could do:

{
  "id": 1234,
  "g1": {
    "type" : "text",
    "confidence" : 0.8
  },
  "g2": {
    "type" : "csv",
    "confidence" : 0.6
  },
  ...
}

Ultimately it's the same as the flat format you posted (Lucene will flatten all those values to g1.type, g1.confidence, etc) it just looks a little cleaner.

Alternatively you could use nested documents, but that's probably overkill for what you need.

Note that your query may need some re-arranging. If you want docs that are 80% CSV, you need to tie together the "type" query with the "confidence" query. Something like:

{
  "query": {
    "bool": {
      "should": [
        {
          "bool": {
            "must": [
              {
                "range": {
                  "c1": {
                    "gte": 0.8
                  }
                }
              },
              {
                "match": {
                  "g1": "csv file"
                }
              }
            ]
          }
        },
        {
          "bool": {
            "must": [
              {
                "range": {
                  "c2": {
                    "gte": 0.8
                  }
                }
              },
              {
                "match": {
                  "g2": "csv file"
                }
              }
            ]
          }
        },
        {
          "bool": {
            "must": [
              {
                "range": {
                  "c3": {
                    "gte": 0.8
                  }
                }
              },
              {
                "match": {
                  "g3": "csv file"
                }
              }
            ]
          }
        }
      ]
    }
  }
}

E.g. (g1:csv AND c1:80%) OR (g2:csv AND c2:80%) OR (g3:csv AND c3:80%)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.