Group_Concat functionality for SQL Server

An interesting request came in a couple weeks ago in querying data from a SQL Server 2012 database for a SSRS report.   How to roll up data to a single row, but, for a given column with a string data type, display all values available?     The data was available in a standard row orientation, and the challenge was to do some type of string aggregate to roll up the data.    The resulting research pitted two approaches against each other to address the requirement for a string aggregate or group concat.     

 

First off, here is the scenario that is looking to be addressed.    


Scenario
 
Source data
Patient                         Allergy1
Patient                         Allergy2
Patient                         Allergy3
Patient                         Allergy4
                   
Patient             Allergy n

 
Desired output
Patient             Allergy1, Allergy2, Allergy3, Allergy4…Allergy n

 
TSQL has to have some sort of function to do this, right?    The answer is no.    It doesn’t.   Other platforms, such as MySQL, has the functionality built in, but with TSQL in SQL Server, you have to code towards it.    

 
What I would have wanted to see in the way a TSQL Statement if one was available is the following:
 
Select Patient, group_concat(Allergy)
From patient_test
Group by Patient 

 

Options to perform a group concatenation type functionality in SQL Server: 


In surfing the web trying to see if someone had come up with a solution for performing a group concatenation / string aggregate for the same scenario I was looking at, I found two options that looked promising.     

      1)      TSQL leveraging a sub-select with XML
      2)      SQL CLR to perform GROUP_CONCAT string aggregate 
            - code is in codeplex at http://groupconcat.codeplex.com/
           - Aggregate functions available from CLR implementation
·         [dbo].[GROUP_CONCAT]
·         [dbo].[GROUP_CONCAT_D]
·         [dbo].[GROUP_CONCAT_DS]
·         [dbo].[GROUP_CONCAT_S]

 
The TSQL version has a unique use of the FOR XML PATH functionality that I would not have considered when looking at the scenario.    By specifying a zero-length string on the row element name, the wrapping element is not produced.   By doing so, TSQL was able to concatenate the values for each row and return in a single column (without the XML tags).  

 The SQL CLR version is a more eloquent version and has reusability for the source code, but requires deploying a SQL CLR function which may not be well received, or permitted, by a DBA staff, and in the second example, can run into issues with handing multiple tables.    



Before I go into two common solutions that are available, I have are general warning(s) about using a group_concat function or tsql equivalent.  


    1)      need to enforce distinct values returned
       - Duplicate values are a risk If not explicitly screened out. 
    2)      Sort order of values
       - Values returned can very easily be returned in the order that the data is stored and not necessarily the sort order expected for reporting.   
     3)      Cardinality of value set
       - The lower the cardinality (higher number of distinct values) will result in a column that one will have a hard time displaying in a report, or potentially cause a truncation error on insert to a table.    

 
 
 

Examples using the AdventureWorks database.

 

Source data that is available as a base for the examples: 

SELECT      CAT.Name AS [Category],
             SUB.Name AS [Sub Categories]
FROM  Production.ProductCategory CAT
inner join [Production].[ProductSubcategory] SUB
      on SUB.ProductCategoryID = CAT.ProductCategoryID
 
 

Scenario 1 - using two tables 

1) TSQL version  

Here we implement a solution using a sub-query and the FOR XML PATH functionality.    It is an interesting use of the FOR XML PATH predicate.    The empty string after the FOR XML PATH remove the row element from the results and allows us to be able to concatenate the subcategory name in a single column.   The STUFF function remove the leading comma from the result.
 
SELECT CAT.Name AS [Category],
        STUFF((    SELECT ',' + SUB.Name AS [text()]
                          FROM Production.ProductSubcategory SUB
                         WHERE SUB.ProductCategoryID = CAT.ProductCategoryID
                         FOR XML PATH('')
                     ), 1, 1, '' ) AS [Sub Categories]
FROM  Production.ProductCategory CAT
 
Results from query.
 
As you can see, the sub category values have been aggregated to a single row with distinct values in sort order.  
 
Execution Plan used in statement above:
 
 
  
 
 

2) SQL CLR  version

 
In using the solution from codeplex, we have a SQL CLR function that has been implemented to a database in SQL Server.   
 SELECT        CAT.Name AS [Category],
                    dbo.GROUP_CONCAT_S(SUB.Name,1) as [Sub Categories]  --,
FROM  Production.ProductCategory CAT
inner join Production.ProductSubcategory SUB
   on SUB.ProductCategoryID = CAT.ProductCategoryID
GROUP BY CAT.Name
 
 Results from query.
 
 
As you can see, the sub category values have been aggregated to a single row with distinct values in sort order, similar to the TSQL version.  
 
 
Execution Plan used in statement above:
 
 
  
So far, both options hold promise to address the scenario at hand.  
 
 
 
 

Scenario 2 - Dig deeper, performing a group concat with three tables

With all things being equal, I wanted to perform a slightly more complex query to see what the execution plan would be, but ended up discovering a variance in results.   
 
 

1) TSQL version  

SELECT CAT.Name AS [Category],
       STUFF(( SELECT ',' + SUB.Name AS [text()]
               FROM Production.ProductSubcategory SUB
               WHERE
                 SUB.ProductCategoryID = CAT.ProductCategoryID
               order by 1
               FOR XML PATH('')
             ), 1, 1, '' ) AS [Sub Categories],
       STUFF(( SELECT distinct ',' + RTRIM(product.Class) AS [text()]
               FROM Production.ProductSubcategory SUB
               inner join [Production].[Product] product
                  on product.ProductSubcategoryID = sub.ProductSubcategoryID
               WHERE
                  SUB.ProductCategoryID = CAT.ProductCategoryID
               order by 1
               FOR XML PATH('')
               ), 1, 1, '' ) AS [Class_available]
FROM Production.ProductCategory CAT

 
 Results from query.
 
 
 
As you can see, the sub category values have been aggregated to a single row with distinct values in sort order and thus, returning what was intended.  
 
 
Execution Plan used in statement above:
 
 
As you add each new sub-query, the impact to the execution plan is a new branch to the execution tree as you can see above.   
 
 
 

2) SQL CLR  version

 
Here we add a second GROUP_CONCAT function to the column list in the SELECT statement, and do not add another column to the group by as we have two aggregate functions in the SELECT.    
 
SELECT        CAT.Name AS [Category],
                      dbo.GROUP_CONCAT_S(SUB.Name,1) as [Sub Categories],
                      dbo.GROUP_CONCAT_DS(product.Class,',',1)    AS [Class_available]
FROM  Production.ProductCategory CAT
inner join Production.ProductSubcategory SUB
   on SUB.ProductCategoryID = CAT.ProductCategoryID
inner join [Production].[Product] product
            on product.ProductSubcategoryID = sub.ProductSubcategoryID
GROUP BY CAT.Name
 
 
 Results from query.
 
 
 
As you can see, the results are eschewed when adding another table / level to the query.  So close, but no cigar.   Further testing allowed me to get the proper classes available when adding the subcategory name to the group by clause, but in doing so, it lost the string aggregate values for Sub Categories.    
 
 
Execution Plan used in statement above:
 

 

Conclusion:

With any canned solution, first test it to make sure it is performing for your scenario at hand.    The possibility that how you are using it is different than what the solution was coded for.    Though the TSQL solution is customized and not reusable, and has more of an impact on the query plan, it is providing proper results for the requested query.