When is the Analyzing Process executed?

TWalter · December 29, 2016, 7:04am

Hello there,

i have two questions.

When is the Anaylzing of a String Field executed. Is it executed during the Lucene Commit, or before the documents are stored in the In-Memory-Buffer of the Lucene Index?

Does the inverted Index also contain numeric fields or only words?

Thanks=)

spinscale · December 29, 2016, 9:05am

Hey,

it is executed before the documents are stored, because the analyzed strings need to be stored in the inverted index.

Further infos:

--Alex

jprante · December 29, 2016, 9:13am

Analyzing strings takes place when new segments are created. Segments are created during flush of the indexing buffer Indexing buffer settings | Elasticsearch Guide [8.11] | Elastic

Lucene commit operation comes later, it writes segments to disk.

The Lucene index format includes kd-Trees for numeric types PointValues (Lucene 6.3.0 API)
but they are not stored in the inverted index.

TWalter · December 29, 2016, 9:38am

Thanks for your reply =)

Do you mean, when the Elasticsearch Index refreshes and the segments are created in the Filesystem-Chache? What do you precisely mean with flush of the buffer? When you drain the buffer?

Thank you! I couldn't find it anywhere. =)

jprante · December 29, 2016, 2:39pm

Here is an overview of the indexing flow of a document:

Elasticsearch Index API receives JSON document
JSON document is passed to the correct node for the correct shard
the documents are queued in the transaction log. The transaction log is separate from Lucene and records the Elastisearch Index API operations, for durability and recovery
after written to transaction log, the document is passed to Lucene API
Lucene IndexWriter queues documents in an internal indexing buffer
if indexing buffer is full, or Lucene IndexWriter flush() method is executed, Lucene creates token streams for all the indexable fields in the document as specified in the Elasticsearch mappings
the token streams are processed by the analyzer as specified in the Elasticsearch mappings
the result is written to a new segment in RAM
the new segment is persisted to disk when Lucene IndexWriter commit() method is executed

Elasticsearch does all the work for you. There is no need to to be concerned about Lucene commit() or flush() or internal buffers. In the Elasticsearch API, you find

flush Flush API | Elasticsearch Guide [8.11] | Elastic works on the Elasticsearch Index model, i.e. it maps the operation to all relevant shards, clears the transaction log, and flushes Lucene indices.
sync flush Synced flush API | Elasticsearch Guide [8.11] | Elastic is like flush but for enhanced resource usage on rarely used indices, the operation can manage unused buffer resources. 'Sync' means, Elasticsearch can recover an index from last sync point in time.
refresh Refresh API | Elasticsearch Guide [8.11] | Elastic is like flush but includes loading the most current segments for search operations and discard unused segments. This makes Elasticsearch index changes visible. Elasticsearch executes refresh by default each 5 seconds.

TWalter · December 29, 2016, 5:52pm

Thanks for your article. =)

Correct me if my speculation is wrong.

The Lucene in Memory Buffer is the same Buffer as the Indexing Buffer
The Lucene IndexWriter flush() will be executed, if the Indexing buffer (In Memory Buffer) is full, or the IndexWriter flush() method is triggered by an Elasticsearch refresh.
After the operations ind the method, the results are written in a new segment in RAM.
the IndexWriter commit() is triggered by the flush Event of Elasticsearch and the segments are persisted to disk

So, the flush of elasticsearch is not the same method than the Lucene IndexWriter flush().

You helped me a lot about understanding the processes behind Elasticsearch =)

jprante · December 29, 2016, 7:39pm

There are two in-memory buffers. The Elasticsearch indexing buffer can hold JSON documents and is aware of Elasticsearch features, like the transaction log. Lucene keeps an internal buffer for Lucene documents. This buffer works "below" Elasticsearch API, on each shard.

Yes, that is my knowledge.

Yes, the results are appended to the current segment, if possible, or a new segment is created.

Exactly. And the transaction log is cleared. See also

The action of performing a commit and truncating the translog is known in Elasticsearch as a flush.

in Making Changes Persistent | Elasticsearch: The Definitive Guide [2.x] | Elastic

Yes. An Elasticsearch flush() is more complex.

For example, Lucene IndexWriter flush() will be executed when executing an Elasticsearch forceMerge() Force merge API | Elasticsearch Guide [8.11] | Elastic So the wording is a little bit confusing.

Also commit() is a bit different. It was one of the difficulties for me to understand why there is no such thing as a commit() API method in Elasticsearch which would trigger Lucene commit() and nothing more. There are Elasticsearch commit points, but they are in the transaction log.

There is also some light interaction between Elasticsearch and Lucene. Lucene can write end-to-end CRC-32 checksums in segments [LUCENE-5925] Use rename instead of segments_N fallback / segments.gen etc - ASF JIRA These are used in Elasticsearch to help the validation of an index while recovering Add extra validation to segments_N files · Issue #8403 · elastic/elasticsearch · GitHub

TWalter · December 30, 2016, 2:51am

Thank you =)

So, you can say, if the Index API receives new JSON Documents, they are first cached in the Elasticsearch Indexing Buffer, then passed to the correct node for the correct shard and after that, written in the transaction log? Will the Elasticsearch Indexing Buffer be cleared after a Elasticsearch flush?

So that will mean, that the Indexing Buffer is the same as the transaction log. But the transaction log is frequently persisted to the diks and the Indexing Buffer is only RAM?

Sorry for the studid questions, but im close to understand the whole Indexing process =)

jprante · December 30, 2016, 9:36am

The sequence is

new JSON document is received
the receiving node consults the cluster state to find the shard ID
JSON document is sent to the node that holds the primary shard. The receiving node forgets about the document (unless it holds the primary shard)
the node of the primary shard has an in-memory indexing buffer. The JSON document is queued in-memory per node.
the per-shard transaction log is written to disk and fsynced, after that, the node reports successful receive of the document. (The document is also sent to nodes that hold replicas of the shard ID.)
now the document is passed to Lucene
etc.

Yes.

TWalter · January 1, 2017, 3:08pm

Thank you =)

Now, the process is clear.

system · January 29, 2017, 3:08pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Analysis - When does it happens? Elasticsearch	3	424	October 2, 2019
In-memory buffer? Elasticsearch	5	2615	January 9, 2018
How to know details about how a document is indexed? Elasticsearch	5	843	July 6, 2017
Are there any comprehensive documents which describe the detailed process of document indexing and searching? Elasticsearch	4	595	July 6, 2017
Elasticsearch indexing storage mechanism Elasticsearch	3	566	July 5, 2017

When is the Analyzing Process executed?

Related topics