When is the Analyzing Process executed?

Hello there,

i have two questions.

When is the Anaylzing of a String Field executed. Is it executed during the Lucene Commit, or before the documents are stored in the In-Memory-Buffer of the Lucene Index?

Does the inverted Index also contain numeric fields or only words?

Thanks=)

Hey,

it is executed before the documents are stored, because the analyzed strings need to be stored in the inverted index.

Further infos:

--Alex

Analyzing strings takes place when new segments are created. Segments are created during flush of the indexing buffer Indexing buffer settings | Elasticsearch Guide [8.11] | Elastic

Lucene commit operation comes later, it writes segments to disk.

The Lucene index format includes kd-Trees for numeric types PointValues (Lucene 6.3.0 API)
but they are not stored in the inverted index.

Thanks for your reply =)

Do you mean, when the Elasticsearch Index refreshes and the segments are created in the Filesystem-Chache? What do you precisely mean with flush of the buffer? When you drain the buffer?

Thank you! I couldn't find it anywhere. =)

Here is an overview of the indexing flow of a document:

  • Elasticsearch Index API receives JSON document
  • JSON document is passed to the correct node for the correct shard
  • the documents are queued in the transaction log. The transaction log is separate from Lucene and records the Elastisearch Index API operations, for durability and recovery
  • after written to transaction log, the document is passed to Lucene API
  • Lucene IndexWriter queues documents in an internal indexing buffer
  • if indexing buffer is full, or Lucene IndexWriter flush() method is executed, Lucene creates token streams for all the indexable fields in the document as specified in the Elasticsearch mappings
  • the token streams are processed by the analyzer as specified in the Elasticsearch mappings
  • the result is written to a new segment in RAM
  • the new segment is persisted to disk when Lucene IndexWriter commit() method is executed

Elasticsearch does all the work for you. There is no need to to be concerned about Lucene commit() or flush() or internal buffers. In the Elasticsearch API, you find

2 Likes

Thanks for your article. =)

Correct me if my speculation is wrong.

  • The Lucene in Memory Buffer is the same Buffer as the Indexing Buffer
  • The Lucene IndexWriter flush() will be executed, if the Indexing buffer (In Memory Buffer) is full, or the IndexWriter flush() method is triggered by an Elasticsearch refresh.
  • After the operations ind the method, the results are written in a new segment in RAM.
  • the IndexWriter commit() is triggered by the flush Event of Elasticsearch and the segments are persisted to disk

So, the flush of elasticsearch is not the same method than the Lucene IndexWriter flush().

You helped me a lot about understanding the processes behind Elasticsearch =)

There are two in-memory buffers. The Elasticsearch indexing buffer can hold JSON documents and is aware of Elasticsearch features, like the transaction log. Lucene keeps an internal buffer for Lucene documents. This buffer works "below" Elasticsearch API, on each shard.

Yes, that is my knowledge.

Yes, the results are appended to the current segment, if possible, or a new segment is created.

Exactly. And the transaction log is cleared. See also

The action of performing a commit and truncating the translog is known in Elasticsearch as a flush.

in Making Changes Persistent | Elasticsearch: The Definitive Guide [2.x] | Elastic

Yes. An Elasticsearch flush() is more complex.

For example, Lucene IndexWriter flush() will be executed when executing an Elasticsearch forceMerge() Force merge API | Elasticsearch Guide [8.11] | Elastic So the wording is a little bit confusing.

Also commit() is a bit different. It was one of the difficulties for me to understand why there is no such thing as a commit() API method in Elasticsearch which would trigger Lucene commit() and nothing more. There are Elasticsearch commit points, but they are in the transaction log.

There is also some light interaction between Elasticsearch and Lucene. Lucene can write end-to-end CRC-32 checksums in segments [LUCENE-5925] Use rename instead of segments_N fallback / segments.gen etc - ASF JIRA These are used in Elasticsearch to help the validation of an index while recovering Add extra validation to segments_N files · Issue #8403 · elastic/elasticsearch · GitHub

Thank you =)

So, you can say, if the Index API receives new JSON Documents, they are first cached in the Elasticsearch Indexing Buffer, then passed to the correct node for the correct shard and after that, written in the transaction log? Will the Elasticsearch Indexing Buffer be cleared after a Elasticsearch flush?

So that will mean, that the Indexing Buffer is the same as the transaction log. But the transaction log is frequently persisted to the diks and the Indexing Buffer is only RAM?

Sorry for the studid questions, but im close to understand the whole Indexing process =)

The sequence is

  1. new JSON document is received
  2. the receiving node consults the cluster state to find the shard ID
  3. JSON document is sent to the node that holds the primary shard. The receiving node forgets about the document (unless it holds the primary shard)
  4. the node of the primary shard has an in-memory indexing buffer. The JSON document is queued in-memory per node.
  5. the per-shard transaction log is written to disk and fsynced, after that, the node reports successful receive of the document. (The document is also sent to nodes that hold replicas of the shard ID.)
  6. now the document is passed to Lucene
  7. etc.

Yes.

1 Like

Thank you =)

Now, the process is clear. :wink:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.