Go Back   CORTEX Forums > Best Practices > Tips and Techniques > Data Warehousing Tips and Techniques > DWH Tip Feeds
Register Blogs FAQ Members List Calendar Search Today's Posts Mark Forums Read

Greatest N per group: top 3 with GROUP_CONCAT()

This is a discussion on Greatest N per group: top 3 with GROUP_CONCAT() within the DWH Tip Feeds forums, part of the Data Warehousing Tips and Techniques category; In my opinion, one of the best things that happened to Planet MySQL lately, is Explain Extended , a blog by Alex Bolenok (also known as Quassnoi on Stackoverflow ). ...


Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old 19th March 2010, 03:56 PM   #1
News Bot
 
Join Date: Nov 2007
Posts: 15,067
Latest News Headlines is on a distinguished road
Post Greatest N per group: top 3 with GROUP_CONCAT()

In my opinion, one of the best things that happened to Planet MySQL lately, is Explain Extended, a blog by Alex Bolenok (also known as Quassnoi on Stackoverflow).

I never had the pleasure of meeting Alex in person, but his articles are always interesting and of high quality, and the SQL wizardry he pulls off is downright inspiring. I really feel humbled by the creativity of some of his solutions and his apparent experience with multiple RDBMS products.

Alex' latest point is about aggregation, and finding a top 3 based on the aggregate:

In MySQL I have a table called meanings with three columns: word, meaning, person. word has 16 possible values, meaning has 26. A person assigns one or more meanings to each word. In the sample above, person 1 assigned two meanings to word 2. There will be thousands of persons. I need to find the top three meanings for each of the 16 words, with their frequencies. Is it possible to solve this with a single MySQL query?
Alex presents a solution that uses GROUP_CONCAT basically as a poor man's windowing function, a technique I have described on several occasions in the past for ranking, median and percentile solutions in MySQL.

Now, Alex' solution is very clever and there are some elements that I think are very creative. That said, I think his solution can be improved still. Normally I wouldn't write a blog about it, and simply leave a comment on his blog, but his blog supports comments only for general articles, which is why I present it here:
SELECT word
, CONCAT(
SUBSTRING_INDEX(
GROUP_CONCAT(meaning ORDER BY num DESC), ',', 1
)
, ' ('
, SUBSTRING_INDEX(
GROUP_CONCAT(num ORDER BY num DESC), ',', 1
) / SUM(num) * 100
, '%)'
) rank1
, CONCAT(
SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(meaning ORDER BY num DESC), ',', 2
), ',', -1
)
, ' ('
, SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(num ORDER BY num DESC), ',', 2
), ',', -1
) / SUM(num) * 100
, '%)'
) rank2
, CONCAT(
SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(meaning ORDER BY num DESC), ',', 3
), ',', -1)
, ' ('
, SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(num ORDER BY num DESC), ',', 3
), ',', -1
) / SUM(num) * 100
, '%)'
) rank3
FROM (
SELECT word, meaning, COUNT(*) num
FROM t_meaning m
GROUP BY word,meaning
) a
GROUP BY word

This gives me output like this:
+------+--------------+--------------+--------------+
| word | rank1 | rank2 | rank3 |
+------+--------------+--------------+--------------+
| 1 | 16 (3.9728%) | 17 (3.9648%) | 12 (3.9632%) |
| 2 | 9 (3.9792%) | 10 (3.9632%) | 20 (3.9328%) |
| 3 | 20 (3.9744%) | 13 (3.968%) | 1 (3.9648%) |
| 4 | 26 (3.952%) | 7 (3.9456%) | 17 (3.9424%) |
| 5 | 9 (4.008%) | 21 (3.9824%) | 20 (3.936%) |
| 6 | 19 (3.9504%) | 10 (3.9488%) | 13 (3.9408%) |
| 7 | 23 (4.0464%) | 12 (3.976%) | 19 (3.9648%) |
| 8 | 23 (4.0112%) | 3 (4.0096%) | 8 (3.9328%) |
| 9 | 10 (4.016%) | 19 (3.984%) | 15 (3.9616%) |
| 10 | 10 (4.0304%) | 14 (3.9344%) | 11 (3.9312%) |
| 11 | 16 (3.9584%) | 6 (3.9296%) | 19 (3.9232%) |
| 12 | 7 (3.9968%) | 1 (3.9392%) | 26 (3.9264%) |
| 13 | 8 (4.048%) | 25 (3.9712%) | 23 (3.9616%) |
| 14 | 16 (3.9936%) | 26 (3.9632%) | 4 (3.9536%) |
| 15 | 22 (4.0608%) | 12 (4.0048%) | 1 (3.9632%) |
| 16 | 14 (4.0032%) | 18 (3.9712%) | 4 (3.9488%) |
+------+--------------+--------------+--------------+
16 rows in set (0.63 sec)
On my laptop, my solution is about 30% faster than the one presented by Alex. Personally I think mine is easier to understand too, but that is a matter of taste.

Anyway, I'm just posting this to share my solution - I do not intend to downplay the one presented by Alex. Instead, I invite everyone interested in SQL, MySQL and PostgreSQL to keep an eye on Alex' blog as well as his excellent answers on Stackoverflow. He's an SQL jedi master in my book :)

Of course, if you have a better solution to crack this problem in MySQL, please leave a comment. I'd love to hear what other people are doing to cope with these kinds of queries.


More from Roland Bouman's Blog ...
Latest News Headlines is offline  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiTweet this Post!
Reply With Quote
Reply

Bookmarks

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is On
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Group Operations Manager admin 2010 Job Archive 0 12th March 2010 10:59 PM
HR / Practice Group Co-ordinator admin 2010 Job Archive 0 5th March 2010 02:01 AM
ISS Group fgeorge Local Vendors and Service Providers 0 28th November 2009 09:54 AM
Smarts Group admin Local Vendors and Service Providers 0 20th November 2009 07:10 AM
New Vendor Discussion Group admin Prediction Markets News Feeds 0 19th November 2009 01:30 PM


All times are GMT +11. The time now is 07:09 PM.

© The Business Intelligence Group

Search Engine Optimization by vBSEO