No Observable Difference Between BBQ and Default Configurations in Elasticsearch – Help with Index Size Comparison

I've been running some tests on Better Binary Quantization (BBQ) in Elasticsearch and comparing it with the default configuration for dense vectors, but I'm not observing the expected differences in disk size or search performance.

Test Setup:

BBQ Index Configuration (my-index):

{
  "mappings": {
    "properties": {
      "vector": {
        "type": "dense_vector",
        "dims": 1024,
        "index_options": {
          "type": "bbq_hnsw",
          "m": 16,
          "ef_construction": 100
        }
      }
    }
  },
  "settings": {
    "index": {
      "number_of_shards": "1",
      "number_of_replicas": "1"
    }
  }
}

Default Index Configuration (my-index-2):

{
  "mappings": {
    "properties": {
      "vector": {
        "type": "dense_vector",
        "dims": 1024,
        "index_options": {
          "type": "int8_hnsw",
          "m": 16,
          "ef_construction": 100
        }
      }
    }
  },
  "settings": {
    "index": {
      "number_of_shards": "1",
      "number_of_replicas": "1"
    }
  }
}

Problem:

I embedded 100K comments, each with 1024 dimensions in each vector (IMS), and tested both configurations. However, the disk usage and search time appear to be almost identical, with no significant improvements from the BBQ configuration. In fact, sometimes the default configuration seems faster in terms of search time.

Index Sizes:

  • BBQ Index (my-index): 1.9GB
  • Default Index (my-index-2): 1.9GB

As shown, there is no difference in disk size between the two configurations.

Questions:

  • Index Size Comparison: How can I accurately measure the size of each index (BBQ vs Default) to check for differences in disk usage?
  • Performance Differences: Has anyone encountered similar results? What settings or tests can I adjust to identify any potential improvements with BBQ?

Hi @mohab_ghobashy , welcome to our community.

Have you read this articles?

Hey @mohab_ghobashy

The disk footprint is dominated by the raw floating point vectors.

How did you determine your disk footprint? (which API, or looking directly at the directory, etc.)

For performance differences, its useful to know the queries utilized (the entire search request), the ES version, and the hardware on which its tested.

Also, how are you measuring search time? Is this the 'took' time in the request or measured client side?

Hey @BenTrent,

Thanks for your thoughts!

I’ve been using the GET /_cat/indices/full-precision-index?v command to track disk usage, and I also rely on the GET /_stats/store API to get a closer look at the storage details.

I’ve been checking the took time on the client side like you mentioned

my Docker setup is showing the following stats for the elasticsearch container:

  • CPU Usage: 1.95%
  • Memory Usage: 5.017GiB / 15.25GiB (32.89%)
  • Block I/O: 3.35GB / 26.1GB

Why are the searching results showing full precision floating-point values for the vectors, even though the BBQ index configuration should use binary precision?

@mohab_ghobashy we keep the raw floating point values around _source is what you provide to ES.

Having the raw values is important for:

  • reindex
  • rescoring via the raw values if desired
  • Re-quantizing and segment merging.

Usually, there is no good reason to actually return the raw vector client side.

I would augment your search to only specifically include returning the text field.

query = {
  "knn": {...},
  "_source": {"includes": ["my_field"]}
}

this should give you a performance boost as serializing many floating point values is very expensive.