Hi there! I've been playing a bit with the dense_vector field recently. I have a collection of thousands of vectors, 100 dimensions each. I created the index with the following config:
I use bulk upload, with a batch size of 64 vectors at the time, but I also tried uploading the vectors one by one. The problem that I face is that somewhere around batches 4040, 8080, etc. there is a massive slow down and the query takes more than a minute to be finished, but for the rest of the calls, the standard 10s timeout of the Python client is enough.
My question is: What may be the root cause of that issue? Is that a garbage collector process or maybe the index is being rebuilt? Or maybe I'm reaching some kind of segment size.
The way we index vectors is we don't build a graph on the fly, we are just buffering vectors.
But once enough vectors are buffered to create a segment (or if there is a refresh command), we create a segment, and that's where the main work of building a graph is starting which may take time. So the indexing itself is very fast, but creating a segment or refresh takes time.
So we would recommend to create segments less often. By default, if there are no searches, a shard is switched to "search_idle" state, and there are no refreshes, so segments will be created only when memory buffer is full ( or limit on translog is reached).
I am wondering if this current behaviour presents an issue for you, or just setting enough timeouts in your client would be enough.
query takes more than a minute to be finished, but for the rest of the calls, the standard 10s timeout of the Python client is enough
By "query" here do you mean a search query or indexing request?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.