However, when I try to aggregate with count and group by year and month, strange results appear:
POST _xpack/sql?format=txt
{
"query": "select year(startTime), month(startTime), count(orderId) from worktop_process where isInternalOrder = true and externalOrderId like 'HM%' group by year(startTime), month(startTime)"
}
year(startTime)|month(startTime)|count(orderId)
---------------+----------------+---------------
2017 |12 |4
2018 |1 |15
2018 |2 |5
2018 |3 |39
2018 |4 |45
2018 |5 |45
2018 |6 |33
2018 |12 |8
The year value is not correct! Mostly it is offset by 1 year. There was not order in the year 2017 for instance. In the raw data, there is 12 orders in the month 2018/12, in the group by aggregate, the 12 orders was split between 2017/12 (4 orders) and 2018/12 (8 orders).
I was totally baffled by the results and can not find out what is wrong. Does any one has a clue? BTW I am using version 7.1.1.
select year(startTime), month(startTime), count(orderId) from worktop_process where isInternalOrder = true and externalOrderId like 'HM%' group by year(startTime)
It's basically missing the second grouped by column. Thanks.
Did you mean removing the second group by column? That would not work:
POST _sql?format=txt
{
"query": "select year(startTime), month(startTime), count(orderId) from worktop_process where isInternalOrder = true and externalOrderId like 'HM%' group by year(startTime)"
}
{
"error": {
"root_cause": [
{
"type": "verification_exception",
"reason": "Found 1 problem(s)\nline 1:25: Cannot use non-grouped column [startTime], expected [year(startTime)]"
}
],
"type": "verification_exception",
"reason": "Found 1 problem(s)\nline 1:25: Cannot use non-grouped column [startTime], expected [year(startTime)]"
},
"status": 400
}
The query below further illustrates that the year function is not properly working with group by:
POST _sql/?format=txt
{
"query": "select year(startTime), count(orderId) from worktop_process where startTime > '2018-01-01' group by year(startTime)"
}
year(startTime)|count(orderId)
---------------+---------------
2017 |1660
2018 |5908
First of all, there is no document in the index with startTime < 2018.
POST _sql/?format=txt
{
"query": "select min(startTime) from worktop_process"
}
min(startTime)
------------------------
2018-03-30T16:21:56.055Z
Second, even it had been ay, they shouldn't appear in the result due to the where clause.
@Yuan_Hong agree that it's probably confusing, but as is it works as expected.
If you use the _sql/translate API to find out what query we are using to get those results, it will probably be more clear.
Basically, SQL is using a date_histogram aggregation with a 31536000000ms (365 days) interval to represent a year. The way date_histogram works with this interval type is to start from January 1st, 1960 (the epoch) and create buckets of 31536000000ms size. And if you do this starting from 01-01-1970, the 48th bucket will start on December 20th, 2017 and end on the same day in 2018. All the dates that fall in this bucket will show you in the results the year 2017 even if they are all in 2018.
You may wonder now why that specific bucket starts on Dec 20th and not on Dec 31st or January 1st. The answer is in the interval itself - 31536000000ms represents 365 days without any notion of leap years (some years have 365 days and some have 366 days). Counting buckets from January 1st until today will, every 4 years, add one day to the calculation and offset the edge of the buckets more and more.
Of course, the solution here is to consider calendar years as an interval, but for consistency reasons we chose to do it like this, with a ms representation of an year. In SQL there is the notion of INTERVAL where one can use INTERVAL 9 MONTH to represent a period of time that is not a "full" one (12 MONTH would be a full period for example). And with these types of intervals, one would use something like this: SELECT HISTOGRAM(my_date, INTERVAL 9 MONTH) AS int FROM test GROUP BY int to build 9 months intervals of time buckets.
For what is worth, there is this issue opened - https://github.com/elastic/elasticsearch/issues/40162 - to bring forth this discrepancy between the expectation and how it actually works. Maybe in the future this behavior will change, but for now it will stay as is.
The logic behind the behaviour is well explained and understood. However, the result is far from what a user with normal RDBMS background would expect.
At least I would suggest to point this peculiarity out in the documentation of SQL date functions so users are not surprised.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.