Questions TTL/GET/

ddorian43 · September 17, 2013, 5:05pm

How are Get and Multiget queries fetched? What data structure is used to
hold this data?

Are expired documents indexed and deleted on cron(like mongodb) or are they
deleted when segments are merged (cassandra/bigtable)?

And what should i use if i want to just filter documents (no scoring),
filter, filtered query what?

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

polyfractal · September 17, 2013, 5:22pm

I don't fully understand your question. Elasticsearch has a binary
serialization protocol - the document is retrieved from the shard,
serialized across the network to the coordinating node, and then sent to
the client as JSON
The TTL cycle runs every 60s by default. Expired documents are
marked as deleted and removed during merges. Lucene has no concept of
random-access writes - segments are immutable. So the only way to delete a
document is to mark it as deleted and remove it during a subsequent merge.
This is also why TTL is generally not recommended for most scenarios...it
is much slower than simply deleting a time-based index
Filtered_query. The top-level filter should only be used when
working with facets and you require fine-grained filter control of search
vs. facet results

-Zach

On Tuesday, September 17, 2013 1:05:56 PM UTC-4, ddorian43 wrote:

How are Get and Multiget queries fetched? What data structure is used to
hold this data?

Are expired documents indexed and deleted on cron(like mongodb) or are
they deleted when segments are merged (cassandra/bigtable)?

And what should i use if i want to just filter documents (no scoring),
filter, filtered query what?

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

ddorian43 · September 20, 2013, 7:45pm

What i mean:
What i mean is how is _uid indexed? Reversed index?

In mongodb TTL, the timestamp is indexed and the db runs a query every x
seconds to find the expired documents and delete .
While in Cassandra, expired cells are only deleted on compactions. There is
not another index that scans like mongodb.

Which TTL type does elasticsearch support?

It looks like you can't range_query on _id field without indexing ? (like
get and multiget doesn't require indexing)

Also when documents are filtered and no sorting is specified, are documents
sorted by _id ?

Thanks

On Tuesday, September 17, 2013 7:05:56 PM UTC+2, ddorian43 wrote:

How are Get and Multiget queries fetched? What data structure is used to
hold this data?

Are expired documents indexed and deleted on cron(like mongodb) or are
they deleted when segments are merged (cassandra/bigtable)?

And what should i use if i want to just filter documents (no scoring),
filter, filtered query what?

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

polyfractal · September 20, 2013, 8:17pm

Answers inline:

What i mean is how is _uid indexed? Reversed index?

_uids are indexed into a bloom filter. The process looks like this:

Find the appropriate shard via routing, forward request to a node
with the shard
Sequentially iterate over each segment in the shard
1. Check bloom filter if DocID exists in this segment. If false,
  move to next shard (blooms guarantee negative results)
2. If yes, perform search through segment, as blooms have some amount
  of false positive

In mongodb TTL, the timestamp is indexed and the db runs a query every x

seconds to find the expired documents and delete .
While in Cassandra, expired cells are only deleted on compactions. There
is not another index that scans like mongodb.

Which TTL type does elasticsearch support?

See my answer in the first email. Elasticsearch's background process runs
every 60s, compares timestamps and then marks documents as deleted. The
physical deletion of data doesn't happen until a merge later removes it.

It looks like you can't range_query on _id field without indexing ? (like

get and multiget doesn't require indexing)

Correct. Queries only operate on inverted index data, so they require the
_id to be indexed along with the other fields. Get/Multiget use a
different mechanism (detailed above w/ bloom filter), they aren't queries,
so they don't require an inverted index.

Also when documents are filtered and no sorting is specified, are documents

sorted by _id ?

If docs are filtered and have no score, the score is automatically set to

Since all documents have the same score, the resulting order is
effectively random. In actuality it is the order of the documents in the
segments, but the merge process tends to shuffle them enough that you can
consider it random.

Hope that helps!
-Zach

Thanks

On Tuesday, September 17, 2013 7:05:56 PM UTC+2, ddorian43 wrote:

How are Get and Multiget queries fetched? What data structure is used to
hold this data?

Are expired documents indexed and deleted on cron(like mongodb) or are
they deleted when segments are merged (cassandra/bigtable)?

And what should i use if i want to just filter documents (no scoring),
filter, filtered query what?

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

ddorian43 · October 15, 2013, 3:00pm

Everything helps.

If i make a filter + sort, only the filtered_documents.sorting_value
must reside in memory or all documents?(i'm guessing all)
Is the doc_type saved as a string or some kind of internal id in the
_uid field (so if i have many small documents i should keep this short)?
If i disable the "_all" field(because i don't want to duplicate
indexes), how i can simulate it (by searching on all the indexed fields?) ?
If i want to use autocomplete on tags field, should i use ngrams or
suggester (when starting to type show a list of tags that are like the
characters, looks like suggester)?
Looks like suggester doesn't return documents ?
If i use a field.type(number), and i wont do range queries(only
equality), what precision_step should i use for this field (0 or 1?) ?

On Fri, Sep 20, 2013 at 10:17 PM, Zachary Tong zacharyjtong@gmail.comwrote:

Answers inline:

What i mean is how is _uid indexed? Reversed index?

_uids are indexed into a bloom filter. The process looks like this:

Find the appropriate shard via routing, forward request to a node
with the shard

Sequentially iterate over each segment in the shard

Check bloom filter if DocID exists in this segment. If false,
move to next shard (blooms guarantee negative results)

If yes, perform search through segment, as blooms have some
amount of false positive

In mongodb TTL, the timestamp is indexed and the db runs a query every x

seconds to find the expired documents and delete .
While in Cassandra, expired cells are only deleted on compactions. There
is not another index that scans like mongodb.

Which TTL type does elasticsearch support?

See my answer in the first email. Elasticsearch's background process runs
every 60s, compares timestamps and then marks documents as deleted. The
physical deletion of data doesn't happen until a merge later removes it.

It looks like you can't range_query on _id field without indexing ? (like

get and multiget doesn't require indexing)

Correct. Queries only operate on inverted index data, so they require the
_id to be indexed along with the other fields. Get/Multiget use a
different mechanism (detailed above w/ bloom filter), they aren't queries,
so they don't require an inverted index.

Also when documents are filtered and no sorting is specified, are

documents sorted by _id ?

If docs are filtered and have no score, the score is automatically set to

Since all documents have the same score, the resulting order is
effectively random. In actuality it is the order of the documents in the
segments, but the merge process tends to shuffle them enough that you can
consider it random.

Hope that helps!
-Zach

Thanks

On Tuesday, September 17, 2013 7:05:56 PM UTC+2, ddorian43 wrote:

How are Get and Multiget queries fetched? What data structure is used to
hold this data?

Are expired documents indexed and deleted on cron(like mongodb) or are
they deleted when segments are merged (cassandra/bigtable)?

And what should i use if i want to just filter documents (no scoring),
filter, filtered query what?

Thanks

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/DnyEVBUOv5U/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
TTL and Recovery Elasticsearch	2	309	July 6, 2017
Handling document deletion Elasticsearch	2	284	July 6, 2017
TTL Documents Elasticsearch	9	5374	February 4, 2019
Difference between expired and deleted document? Elasticsearch	3	355	July 6, 2017
Data expiration and ttl Elasticsearch	6	7197	July 5, 2017

Questions TTL/GET/

Related topics