When is the Anaylzing of a String Field executed. Is it executed during the Lucene Commit, or before the documents are stored in the In-Memory-Buffer of the Lucene Index?
Does the inverted Index also contain numeric fields or only words?
Do you mean, when the Elasticsearch Index refreshes and the segments are created in the Filesystem-Chache? What do you precisely mean with flush of the buffer? When you drain the buffer?
Here is an overview of the indexing flow of a document:
Elasticsearch Index API receives JSON document
JSON document is passed to the correct node for the correct shard
the documents are queued in the transaction log. The transaction log is separate from Lucene and records the Elastisearch Index API operations, for durability and recovery
after written to transaction log, the document is passed to Lucene API
Lucene IndexWriter queues documents in an internal indexing buffer
if indexing buffer is full, or Lucene IndexWriter flush() method is executed, Lucene creates token streams for all the indexable fields in the document as specified in the Elasticsearch mappings
the token streams are processed by the analyzer as specified in the Elasticsearch mappings
the result is written to a new segment in RAM
the new segment is persisted to disk when Lucene IndexWriter commit() method is executed
Elasticsearch does all the work for you. There is no need to to be concerned about Lucene commit() or flush() or internal buffers. In the Elasticsearch API, you find
flushFlush API | Elasticsearch Guide [8.11] | Elastic works on the Elasticsearch Index model, i.e. it maps the operation to all relevant shards, clears the transaction log, and flushes Lucene indices.
sync flushSynced flush API | Elasticsearch Guide [8.11] | Elastic is like flush but for enhanced resource usage on rarely used indices, the operation can manage unused buffer resources. 'Sync' means, Elasticsearch can recover an index from last sync point in time.
refreshRefresh API | Elasticsearch Guide [8.11] | Elastic is like flush but includes loading the most current segments for search operations and discard unused segments. This makes Elasticsearch index changes visible. Elasticsearch executes refresh by default each 5 seconds.
The Lucene in Memory Buffer is the same Buffer as the Indexing Buffer
The Lucene IndexWriter flush() will be executed, if the Indexing buffer (In Memory Buffer) is full, or the IndexWriter flush() method is triggered by an Elasticsearch refresh.
After the operations ind the method, the results are written in a new segment in RAM.
the IndexWriter commit() is triggered by the flush Event of Elasticsearch and the segments are persisted to disk
So, the flush of elasticsearch is not the same method than the Lucene IndexWriter flush().
You helped me a lot about understanding the processes behind Elasticsearch =)
There are two in-memory buffers. The Elasticsearch indexing buffer can hold JSON documents and is aware of Elasticsearch features, like the transaction log. Lucene keeps an internal buffer for Lucene documents. This buffer works "below" Elasticsearch API, on each shard.
Yes, that is my knowledge.
Yes, the results are appended to the current segment, if possible, or a new segment is created.
Exactly. And the transaction log is cleared. See also
The action of performing a commit and truncating the translog is known in Elasticsearch as a flush.
Also commit() is a bit different. It was one of the difficulties for me to understand why there is no such thing as a commit() API method in Elasticsearch which would trigger Lucene commit() and nothing more. There are Elasticsearch commit points, but they are in the transaction log.
So, you can say, if the Index API receives new JSON Documents, they are first cached in the Elasticsearch Indexing Buffer, then passed to the correct node for the correct shard and after that, written in the transaction log? Will the Elasticsearch Indexing Buffer be cleared after a Elasticsearch flush?
So that will mean, that the Indexing Buffer is the same as the transaction log. But the transaction log is frequently persisted to the diks and the Indexing Buffer is only RAM?
Sorry for the studid questions, but im close to understand the whole Indexing process =)
the receiving node consults the cluster state to find the shard ID
JSON document is sent to the node that holds the primary shard. The receiving node forgets about the document (unless it holds the primary shard)
the node of the primary shard has an in-memory indexing buffer. The JSON document is queued in-memory per node.
the per-shard transaction log is written to disk and fsynced, after that, the node reports successful receive of the document. (The document is also sent to nodes that hold replicas of the shard ID.)
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.