Elasticsearch record aggregation in Power BI

Hi there!

I'm hoping someone here will have encountered a similar problem. I'm trying to query the Elasticsearch API to pull data into Power BI for visualization. What I quickly realized is that with larger datasets (>100k records), the Elasticsearch API return limit of 10k records became a barrier. The obvious solution was to implement a scrolling query. Unfortunately, Power BI did not understand how to handle this, and it would overwrite the records pulled. To clarify, if I pulled records in 5k chunks, it would see all 100k records, but only store the last 5k seen.

Someone on the Power BI forums solved this with a recursive script that, to the best of my understanding, only writes the records once a counter limit has been reached, or once all matching records have been returned. Recursive script here.

I had to modify this slightly, because my scroll_id was too large to be passed as a header or in the URL, but rather needed to be passed in the body. That brings us to the script shared below. It works, with a limit. The first time Elasticsearch is queried and the scroll session is established, I get a unique set of records. All subsequent queries (10, in this case), return a second set of 10k records.

Does anyone have suggestions as to why this is happening? I'm inclined to think this is on the side of how I'm interacting with the Elasticsearch API, but I'm open to other ideas.

Please let me know if you have any clarifying questions!

## example values for parameters passed to the function on initial call
url = http://localhost:9200/example_index/_search?scroll=1m
scrollURL = http://localhost:9200/_search/scroll
urlBody = {"size":10,"query":{"match_all":{}}}
scrollBody = {"scroll":"10m","scroll_id":"
scrollID = [null]
counter = [null]
scrollEND = "}

Script to recursively query elasticsearch through use of a scrolling query

let
	## Defining function
    RecursiveElasticFetchJSON = (url, scrollURL, urlBody, scrollBody, scrollID, counter, scrollEND) =>

    let
        Counter = if (counter = null) then 0 else counter,

        ## If function is called w/o a scrollID, then make the initial query, otherwise, query with a scrolling query
        Results = if (scrollID = null) then
            Json.Document(Web.Contents(url,[Content = Text.ToBinary(urlBody)]))
        else
            Json.Document(Web.Contents(scrollURL,[Content = Text.ToBinary(scrollBody&scrollID&scrollEND)])),

        ## Parsing results for later processing
        ParsedResults = Table.FromList(Results[hits][hits], Splitter.SplitByNothing(), null, null, ExtraValues.Error),

        ## Aggregates results and calls the function again, or returns the results
        Return = if (Counter < 10) then
            ParsedResults & RecursiveElasticFetchJSON(url, scrollURL, urlBody, scrollBody, Results[_scroll_id], Counter+1, scrollEND)
        else
            ParsedResults
    in
        Return
in
    RecursiveElasticFetchJSON

Ya I have this problem too.. HELP!!!!

Replying to close this out for anyone else who might encounter this in the future, Power BI caches the results from repeat calls to the same API. To work around this problem, you can add a unique value to the header each time you call out. Using the counter value already specified in the script is a good way to achieve this.

Credit to KenvM over on the Power BI forums for providing the solution.

Where to add the header value:
Json.Document(Web.Contents(scrollURL,[Content ==Text.ToBinary(scrollBody&scrollID&scrollEND), **Headers=[MyHeader=Text.From(Counter)]**])),

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.