New index_options vs term_vector?


(Tanguy) #1

Hi,

Can someone be kind enough to explain the difference between term_vector
and the new index_options (0.20) parameter? Not sure to understand what
docs, freqs and positions means.

Thanks,

-- Tanguy
Twitter: @tlrx

--


(simonw-2) #2

Hey Tanguy,

first of all index_options & term_vectors are two totally different things.
index_options are "options" for the index you are searching on, a
datastructure that holds "terms" to document lists (posting lists).
TermVectors are a datastructure that gives you the "terms" for a given
document and in addition their position in the document as well as their
start and end character offsets. Now the index (each field has such an
index) holds a sorted list of terms and each term points to a posting list.
these posting lists are a list of documents that contain the term. On the
posting list you can also store information like frequencies (how often did
term Y occur in document X -> useful for scoring) as well as "positions"
(at which position did term Y occur in document X -> this is required fo
phrase & span queries).

if you have for instance a field that you only use for filtering you don't
need freqs and postions so documents only will do the job. In an index the
position information is the biggest piece of data usually aside stored
fields. If you don't do phrase queries or spans you don't need them at all
so safe the disk space and improve perf by only use docs and freqs. In
previous version it wasn't possible to have only freqs but no positions
(index_options supersede omit_term_frequencies_and_positions) so this is an
improvement overall since the most common usecase might only need freqs but
no positions.

hope this makes more sense to you now.

simon

On Thursday, October 25, 2012 9:10:09 AM UTC+2, Tanguy wrote:

Hi,

Can someone be kind enough to explain the difference between term_vector
and the new index_options (0.20) parameter? Not sure to understand what
docs, freqs and positions means.

Thanks,

-- Tanguy
Twitter: @tlrx
https://github.com/tlrx

--


(Tanguy) #3

Thanks Simon, it's clear for me now :o)

-- Tanguy

Le jeudi 25 octobre 2012 10:03:55 UTC+2, simonw a écrit :

Hey Tanguy,

first of all index_options & term_vectors are two totally different
things. index_options are "options" for the index you are searching on, a
datastructure that holds "terms" to document lists (posting lists).
TermVectors are a datastructure that gives you the "terms" for a given
document and in addition their position in the document as well as their
start and end character offsets. Now the index (each field has such an
index) holds a sorted list of terms and each term points to a posting list.
these posting lists are a list of documents that contain the term. On the
posting list you can also store information like frequencies (how often did
term Y occur in document X -> useful for scoring) as well as "positions"
(at which position did term Y occur in document X -> this is required fo
phrase & span queries).

if you have for instance a field that you only use for filtering you don't
need freqs and postions so documents only will do the job. In an index the
position information is the biggest piece of data usually aside stored
fields. If you don't do phrase queries or spans you don't need them at all
so safe the disk space and improve perf by only use docs and freqs. In
previous version it wasn't possible to have only freqs but no positions
(index_options supersede omit_term_frequencies_and_positions) so this is an
improvement overall since the most common usecase might only need freqs but
no positions.

hope this makes more sense to you now.

simon

On Thursday, October 25, 2012 9:10:09 AM UTC+2, Tanguy wrote:

Hi,

Can someone be kind enough to explain the difference between term_vector
and the new index_options (0.20) parameter? Not sure to understand what
docs, freqs and positions means.

Thanks,

-- Tanguy
Twitter: @tlrx
https://github.com/tlrx

--


(system) #4