Hello,
I want to run an analysis on Elasticsearch to retrieve the content of its reverse indexes.
I need this information to know the list of document ID associated to combination of field/value stored in the reverse index:
For instance, given these 4 documents in my cluster:
{'_id': 1, 'os': 'linux', 'lang': 'python'}
{'_id': 2, 'os': 'linux', 'lang': 'perl'}
{'_id': 3, 'os': 'mac', 'lang': 'ruby'}
{'_id': 4, 'os': 'bsd 'lang': 'python'}
I want to return the following results, where '_ids' contains the list of document id:
{'os': 'linux', '_ids': [1, 2]}
{'os': 'mac', '_ids': [3]}
{'os': 'bsd', '_ids': [4]}
{'lang': 'python', '_ids': [1, 4]}
{'lang': 'ruby', '_ids': [3]}
{'lang': 'perl', '_ids': [2]}
I tested the Composite Aggregation API, but I was only able to return the document count and not the full list of document id:
{'os': 'linux', 'docs': 2}
{'os': 'mac, 'docs': 1}
{'os': 'bsd, 'docs': 1}
{'lang': 'python', docs: 2}
{'lang': 'ruby', 'docs': 1}
{'lang': 'perl', 'docs': 1}
At this point, my options are either to use Elasticsearch Hadoop or to migrate my data to Hadoop directly (the index contains more than 6 million document, and takes 1.7 TB). I could also run a single query per field / value, but that would be really inefficient (in my previous example, that would be 6 queries for 4 documents).
Do you know an Elasticsearch API I can use to extract this information ?
Is there a more lightweight alternative to using Hadoop for this case ?
Thank you for your help