Hi all,
I'm investigating possible strategies for the following situation:
I have data in a graph where the nodes and edges represent knowledge about
certain topics. These topics may occur in unstructured text. The knowledge
about these topics is used in an analysis process to make sense of
unstructured text. The analysis results are indexed in ElasticSearch. The
graph is stored simply in MySQL for now. It's not really large (about 4000
nodes and 4000 edges/relationships), but the expectation is that this will
grow substantially.
The most important part of the analysis process involves identifying how
well topics are represented in the unstructured text. This is done based on
a number of rules which are represented in the knowledge graph. The
analysis results of a single piece of unstructured text consists of a list
of identified topics as well as a number of characteristics per topic. A
topic is considered to be well-represented when it is found by more rules
coming from the knowledge graph. I.e. a piece of text can have a topic to
be represented if it meets a single rule, but if a second piece of text has
the same topic represented by meeting 10 rules, the seconds document should
score better in search results.
Searching the analysis results through ElasticSearch is performed using a
combination of filters and queries. Score is calculated using a function
score query. The script score part of this uses document fields (the
characteristics for each topic) as well as a number of parameters in the
formula.
When I search for the data, the query contains a number of topics I wish to
search for (let's say 40 topics) and finds documents that match best. I am
getting the right results when I search the data, which is great.
The only issue I have is the following: The knowledge in the graph is
updated regularly. Updates to the graph are required to be reflected in the
scoring of documents in the ElasticSearch index, leading to better search
results.
There are different strategies to have the changes to the graph reflected
in the scoring by ElasticSearch:
-
Periodically re-analyse all pieces of unstructured text and index the
results in ElasticSearch again - A lot of precalculations are performed
and stored in the ElasticSearch index. An index alias could be used to
switch between a "live" and "rebuilding" index. The benefit here is that it
is easy to implement and the queries are really fast as like <50ms as much
is precalculated. The drawback here is that changes in the graph are only
reflected in the ElasticSearch search scoring after a period of time (in my
case about 8 hours) as the analysis process takes long to perform. -
Move parts of the analysis process to query-execution time by
dynamically building a filter+query using the knowledge graph to identify
the topics and calculate the characteristics where possible on the fly
using function score queries with script scores. The benefit is that the
changes in the graph do not always require periodic updates to the entire
index. The drawback here is that if a graph section used to build the query
has lots of related nodes, the resulting query DSL becomes huge and has
lots of bool clauses. This requires overhead to programmatically construct
the query, provide it to ElasticSearch and ElasticSearch also takes longer
to perform the query (800 milliseconds). Going this route I have queries
which are about 2 megabytes and contain 4000+ boolean clauses.
My wish is that I have changes updated asap in ElasticSearch. Within a
couple of seconds is fine.
I am wondering if there are other strategies possible. I hope the above
clarifies my challenges enough for you to answer, but ask away if you have
questions. I just can't detail too much because of non-disclosure
I'm open to using other technologies aside ElasticSearch, and ElasticSearch
plugins.
Kind regards,
Eric
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3d1092a7-a708-4706-bc59-df4523cab47c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.