Finding count by distinct text row using Lens

Hello,

I am very very new to this, Like up until 12 hours ago I hadn't even heard of elastic Search or Kibana.

I do however have basic knowledge of SQL and Other Data visualization tools (including Data Studio and others).

I am using this with Twint (GitHub - twintproject/twint: An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations.) trying to analyze some data.

I want to find similar tweets based on text (& record count of occurrence) but it looks like I can using text based fields in lens.

Now, I can use text bases fields in discover but it doesn't aggregate to provide me a count of the same.

Am I looking at this incorrectly?

Please assist if you can.

I'm not sure I'm fully understanding what you want to do.

I want to find similar tweets based on text

What's "similar" here? Do you mean exactly the same text? If that's the case you can use the "Top values" function for the tweet field together with the "Count" function in Lens.

Hi,

Yes, "exact text match". But the problem is the tweet field is not visible in lens.

It is however visible in Discovery. Do I need to create some sort of index or something?

Adding more to it.

The available fields in discover are 37 but in less there are only 19 (see screenshots below).

Thanks,
Chetan

OK, so the difference between the Discover and Lens views are Discover working on individual documents while Lens works with aggregated data only.

Depending on how your index is configured, not all fields are aggregatable. For Discover this doesn't matter - it can still show all of them. But in Lens, the list of fields is pre-filtered.

If you go to Management > Index patterns you can see this:

agent.keyword is usable in Lens and will show up in the list, while agent is not. In Discover, you can use both fields.

If the tweet message field shows up in Discover but not in Lens it means it's not aggregatable. To fix this, you have to change your mapping and make sure it's indexed as "keyword" (not only text). Then you need to re-index your existing data so the aggregatable index is built within Elasticsearch.

A common mapping to use (like in the screenshot above) is to have the original field indexed as "text" for full text search with a second field suffixed with ".keyword" for the keyword indexed version of the same data for aggregations. This gives you the best of both worlds:

        "agent" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }

Thanks @flash1293, this helps understand the poroblem. Yes, the field is not aggregatable in the index.

I am trying to do the above based on the code you provided but that doesn't seem to be working. I am positive I am missing something here.

This tries to set the mapping for the tweet type, not the tweet field. I think the best approach is to create a completely new index mapping (maybe twinttweets_fixed) with all fields, including the keyword version of the tweet field, then use the re-index api to shovel the data from twinttweets to twinttweets_fixed.

That's a lot of learning for me right there, I'll go about doing that. Thank you @flash1293