An interesting request came
in a couple weeks ago in querying data from a SQL Server 2012 database for a SSRS
report. How to roll up data to a
single row, but, for a given column with a string data type, display all values available? The
data was available in a standard row orientation, and the challenge was to do
some type of string aggregate to roll up the data. The resulting research pitted two approaches
against each other to address the requirement for a string aggregate or group concat.
First off, here is the
scenario that is looking to be addressed.
Scenario
Source data
Patient Allergy1
Patient Allergy2
Patient Allergy3
Patient Allergy4
… …
Patient Allergy n
Desired output
Patient Allergy1, Allergy2, Allergy3,
Allergy4…Allergy n
TSQL has to have some sort of
function to do this, right? The answer
is no. It doesn’t. Other
platforms, such as MySQL, has the functionality built in, but with TSQL in SQL
Server, you have to code towards it.
What I would have wanted to
see in the way a TSQL Statement if one was available is the following:
Select Patient,
group_concat(Allergy)
From patient_test
Group by Patient
Options to perform a
group concatenation type functionality in SQL Server:
In surfing the web trying to see if someone had come up with a solution for performing a group concatenation / string aggregate for the same scenario I was looking at, I found two options that looked promising.
1)
TSQL leveraging a
sub-select with XML
2)
SQL CLR to
perform GROUP_CONCAT string aggregate
- code is in codeplex at http://groupconcat.codeplex.com/
- Aggregate functions available from CLR implementation
·
[dbo].[GROUP_CONCAT]
·
[dbo].[GROUP_CONCAT_D]
·
[dbo].[GROUP_CONCAT_DS]
·
[dbo].[GROUP_CONCAT_S]
The TSQL version has a unique
use of the FOR XML PATH functionality that I would not have considered when
looking at the scenario. By specifying a zero-length
string on the row element name, the wrapping element is not produced. By doing so, TSQL was able to concatenate the values for each row and return in a single column (without the XML tags).
The SQL CLR version is a more
eloquent version and has reusability for the source code, but requires
deploying a SQL CLR function which may not be well received, or permitted, by a
DBA staff, and in the second example, can run into issues with handing multiple
tables.
Before I go into two common solutions that are available, I have are general warning(s) about using a group_concat function or tsql equivalent.
1) need to enforce distinct values returned
- Duplicate values are a risk If not explicitly screened out.
2) Sort order of values
- Values returned can very easily be returned in the order that the data is stored and not necessarily the sort order expected for reporting.
3) Cardinality of value set
- The lower the cardinality (higher number of distinct values) will result in a column that one will have a hard time displaying in a report, or potentially cause a truncation error on insert to a table.
Examples using the AdventureWorks
database.
Source data that is available as a base for the examples:
SELECT CAT.Name AS [Category],
SUB.Name AS [Sub Categories]
FROM
Production.ProductCategory CAT
inner join
[Production].[ProductSubcategory] SUB
on SUB.ProductCategoryID =
CAT.ProductCategoryID
Scenario 1 - using two tables
1) TSQL version
Here we implement a solution using a sub-query
and the FOR XML PATH functionality. It is an interesting use of the FOR XML PATH predicate. The empty string after the FOR XML PATH remove the row element from the results and allows us to be able to concatenate the subcategory name in a single column. The STUFF function remove the leading comma from the result.
SELECT CAT.Name AS [Category],
STUFF(( SELECT ',' + SUB.Name AS [text()]
FROM
Production.ProductSubcategory SUB
WHERE SUB.ProductCategoryID = CAT.ProductCategoryID
FOR XML PATH('')
), 1, 1, '' ) AS [Sub Categories]
FROM
Production.ProductCategory CAT
Results from query.
As you can see, the sub category values have been aggregated to a single row with distinct values in sort order.
Execution Plan used in statement above:
2) SQL CLR version
In using the solution from codeplex, we have a SQL CLR function that has been implemented to a
database in SQL Server.
SELECT CAT.Name
AS [Category],
dbo.GROUP_CONCAT_S(SUB.Name,1) as [Sub Categories]
--,
FROM Production.ProductCategory CAT
inner join Production.ProductSubcategory SUB
on SUB.ProductCategoryID = CAT.ProductCategoryID
GROUP BY CAT.Name
Results from query.
As you can see, the sub category values have been aggregated to a single row with distinct values in sort order, similar to the TSQL version.
Execution Plan used in statement above:
So far, both options hold promise to address the scenario at hand.
Scenario 2 - Dig deeper, performing a group concat with three tables
With all things being equal, I wanted to perform a slightly more complex query to see what the execution plan would be, but ended up discovering a variance in results.
1) TSQL version
SELECT CAT.Name AS [Category],
STUFF(( SELECT ',' + SUB.Name AS [text()]
FROM Production.ProductSubcategory SUB
WHERE
SUB.ProductCategoryID = CAT.ProductCategoryID
order by 1
FOR XML PATH('')
), 1, 1, '' ) AS [Sub Categories],
STUFF(( SELECT distinct ',' + RTRIM(product.Class) AS [text()]
FROM Production.ProductSubcategory SUB
inner join [Production].[Product] product
on product.ProductSubcategoryID = sub.ProductSubcategoryID
WHERE
SUB.ProductCategoryID = CAT.ProductCategoryID
order by 1
FOR XML PATH('')
), 1, 1, '' ) AS [Class_available]
FROM Production.ProductCategory CAT
Results from query.
As you can see, the sub category values have been aggregated to a single row with distinct values in sort order and thus, returning what was intended.
Execution Plan used in statement above:
As you add each new sub-query, the impact to the execution plan is a new branch to the execution tree as you can see above.
2) SQL CLR version
Here we add a second GROUP_CONCAT function to the column list in the SELECT statement, and do not add another column to the group by as we have two aggregate functions in the SELECT.
SELECT CAT.Name
AS [Category],
dbo.GROUP_CONCAT_S(SUB.Name,1) as [Sub Categories],
dbo.GROUP_CONCAT_DS(product.Class,',',1) AS [Class_available]
FROM Production.ProductCategory CAT
inner join Production.ProductSubcategory SUB
on SUB.ProductCategoryID = CAT.ProductCategoryID
inner join [Production].[Product] product
on product.ProductSubcategoryID
= sub.ProductSubcategoryID
GROUP BY CAT.Name
Results from query.
As you can see, the results are eschewed when adding another table / level to the query. So close, but no cigar. Further testing allowed me to get the proper classes available when adding the subcategory name to the group by clause, but in doing so, it lost the string aggregate values for Sub Categories.
Execution Plan used in statement above:
Conclusion:
With any canned solution, first test it to make sure it is performing for your scenario at hand. The possibility that how you are using it is different than what the solution was coded for. Though the TSQL solution is customized and not reusable, and has more of an impact on the query plan, it is providing proper results for the requested query.