案例分享:T-SQL 資料轉置(Transpose Data)

案例分享:T-SQL 資料轉置(Transpose Data)

這幾天論壇上有個問題:如何計算各科各等第的人數,是關於如何下 SQL 查詢將資料做轉置(有時稱「樞紐分析」,雖然 Excel 的樞紐分析功能強大許多…),對熟悉 T-SQL 的開發人員來說這需求其實滿常見的,若曾經開發過報表多半都碰上過幾次。

本案例的癥結點是原始資料表的設計不利於轉成發問者要輸出的格式,我們拿原始資料以及想要輸出的結果放在一起對照看看:

transpose data - 1.compare two results 

容我用自己的觀點來論述一下(有錯還請指正),由上圖做比較後兩者格式可說是幾乎一樣,差別只在「觀測角度」,如果要給報表一個名稱,大概會稱之為學生各科成績級分(以學生為主軸),另一則是級分各科人數統計(以級分為主軸),顯而易見的是「學生」跟「級分」兩個維度並無關係,所以無法用任何算式做轉換,估計必須將原始資料做拆解、重組之後才能輸出想要的結果。

因為發問者不只在一個技術論壇提問,剛好我常去的兩大論壇都有,因此也能順便見識一下眾家高手的不同見解。截至目前為止,包含我自己提出的解法共有三種,接下來會為各位逐一列示,但在這之前我們得先在 tempdb 建立樣本資料:
(
	STID INT NOT NULL, 
	Math INT NOT NULL, 
	Social INT NOT NULL, 
	Nature INT NOT NULL, 
	English INT NOT NULL, 
	Chinese INT NOT NULL, 
	Music INT NOT NULL, 
	Physic INT NOT NULL, 
	Work INT NOT NULL
);

INSERT INTO #t VALUES(9001, 5, 3, 3, 1, 0, 4, 1, 2);
INSERT INTO #t VALUES(9002, 0, 2, 3, 5, 1, 4, 5, 1);
INSERT INTO #t VALUES(9003, 3, 3, 4, 4, 4, 4, 2, 4);
INSERT INTO #t VALUES(9004, 2, 4, 4, 3, 5, 2, 5, 4);
INSERT INTO #t VALUES(9005, 4, 3, 1, 2, 4, 2, 5, 1);
INSERT INTO #t VALUES(9006, 4, 3, 5, 3, 5, 2, 3, 5);
INSERT INTO #t VALUES(9007, 0, 4, 1, 5, 0, 2, 5, 2);
INSERT INTO #t VALUES(9008, 3, 5, 3, 1, 5, 4, 3, 5);
INSERT INTO #t VALUES(9009, 3, 5, 5, 4, 4, 1, 3, 4);
INSERT INTO #t VALUES(9010, 3, 4, 3, 2, 4, 0, 3, 0);
INSERT INTO #t VALUES(9011, 5, 1, 1, 4, 3, 0, 0, 3);
INSERT INTO #t VALUES(9012, 2, 2, 1, 3, 1, 1, 5, 2);
INSERT INTO #t VALUES(9013, 4, 1, 4, 4, 3, 0, 0, 4);
INSERT INTO #t VALUES(9014, 3, 3, 5, 0, 5, 2, 1, 1);
INSERT INTO #t VALUES(9015, 2, 4, 0, 5, 3, 1, 0, 1);
GO

首先來看解法一:
CREATE TABLE #lv
(
	[Level] INT NOT NULL
);

INSERT INTO #lv VALUES(0);
INSERT INTO #lv VALUES(1);
INSERT INTO #lv VALUES(2);
INSERT INTO #lv VALUES(3);
INSERT INTO #lv VALUES(4);
INSERT INTO #lv VALUES(5);
GO

SELECT 
	#lv.[Level], M.MathC, S.SocialC, N.NatureC, E.EnglishC, C.ChineseC, MU.MusicC, P.PhysicC, W.WorkC
FROM #lv 
	LEFT OUTER JOIN (SELECT Math, COUNT(Math) AS MathC FROM #t GROUP BY Math) M ON #lv.[Level] = M.Math
	LEFT OUTER JOIN (SELECT Social, COUNT(Social) AS SocialC FROM #t GROUP BY Social) S ON #lv.[Level] = S.Social
	LEFT OUTER JOIN (SELECT Nature, COUNT(Nature) AS NatureC FROM #t GROUP BY Nature) N ON #lv.[Level] = N.Nature
	LEFT OUTER JOIN (SELECT English, COUNT(English) AS EnglishC FROM #t GROUP BY English) E ON #lv.[Level] = E.English
	LEFT OUTER JOIN (SELECT Chinese, COUNT(Chinese) AS ChineseC FROM #t GROUP BY Chinese) C ON #lv.[Level] = C.Chinese
	LEFT OUTER JOIN (SELECT Music, COUNT(Music) AS MusicC FROM #t GROUP BY Music) MU ON #lv.[Level] = MU.Music
	LEFT OUTER JOIN (SELECT Physic, COUNT(Physic) AS PhysicC FROM #t GROUP BY Physic) P ON #lv.[Level] = P.Physic
	LEFT OUTER JOIN (SELECT Work, COUNT(Work) AS WorkC FROM #t GROUP BY Work) W ON #lv.[Level] = W.Work;
GO

這個解法很直覺,分別對各科算出各級分的統計人數,再以外部聯結(OUTER JOIN)的方式併回級分表,而且是有幾科就需聯結幾次,由於外部聯結本來就需要多花一些成本,加上反覆操作資料的交互影響下,料想效能不會太好…。底下是查詢計劃,預估的查詢成本為 0.0341371

transpose table - 2.outer join cost

緊接著保留解法一的前半段,從 15 行以後的指令碼改為底下所示,就是解法二了:
	[Level],
	SUM(CASE WHEN Math = [Level] THEN 1 ELSE 0 END) AS Math, 
	SUM(CASE WHEN Social = [Level] THEN 1 ELSE 0 END) AS Social,
	SUM(CASE WHEN Nature = [Level] THEN 1 ELSE 0 END) AS Nature,
	SUM(CASE WHEN English = [Level] THEN 1 ELSE 0 END) AS English,
	SUM(CASE WHEN Chinese = [Level] THEN 1 ELSE 0 END) AS Chinese,
	SUM(CASE WHEN Music = [Level] THEN 1 ELSE 0 END) AS Music,
	SUM(CASE WHEN Physic = [Level] THEN 1 ELSE 0 END) AS Physic,
	SUM(CASE WHEN Work = [Level] THEN 1 ELSE 0 END) AS Work
FROM #t CROSS JOIN #lv 
GROUP BY [Level];
GO

這寫法的邏輯是把原始資料跟級分表先做乘積(CROSS JOIN)展開成一組對照表,然後查表取統計值。參考下圖可以看到 STID 是 9001 的這位學生,其 Math 這一科得到 5 級分,對應到 Level = 5 這一欄,所以數學這一科 5 級分的統計人數要加 1:

transpose table - 3.cross join concept 

最後的數據做加總就能得出結果,執行效率也不錯,查詢計劃顯示成本是 0.0188808(值得一提的是這語法可以在 SQL Server 2000 的環境中使用):

transpose table - 4.cross join cost

第三種解法是我靈機一動嘗試寫看看得出來的,指令如下:
WITH unpvt AS
(
	-- 先反轉置
	SELECT STID, [Subject], [Level]
	FROM (
		SELECT STID, Math, Social, Nature, English, Chinese, Music, Physic, Work
		FROM #t
	) AS t
	UNPIVOT (
		[Level] FOR [Subject] IN 
		(Math, Social, Nature, English, Chinese, Music, Physic, Work)
	) AS unpvt
), pvt AS (
	-- 轉置
	SELECT [Level], Math, Social, Nature, English, Chinese, Music, Physic, Work
	FROM (
		-- 彙總(計算人數)
		SELECT [Subject], [Level], COUNT(STID) AS Counter
		FROM unpvt
		GROUP BY [Subject], [Level]
	) AS src
	PIVOT
	(
		SUM(Counter) FOR [Subject] IN
		(Math, Social, Nature, English, Chinese, Music, Physic, Work)
	) AS pvt
)
SELECT *
FROM pvt;
GO

本語法不需要建立級分表,而我的邏輯也很簡單,本文開頭就說過兩個毫無關係的維度無法直接轉換,惟有將資料還原為原本該有的樣子才能進一步處理。由於 CTE、PIVOT、UNPIVOT 都是 SQL Server 2005 之後出現的新功能,因此舊版 SQL Server 無法使用是其缺點,但相對來說優點就是查詢效能令人感到意外地好(優於前兩者),查詢計劃顯示的預估成本是 0.0166329,請參考下圖:

transpose table - 5.unpivot and pivot cost

寫到這裡,我想把發問者的原始資料反轉置之後實體化為一個比較正規設計的資料表,看看假如今天設計正確的話,直接使用 PIVOT 轉置資料表的查詢成本為何,完整語法如下(需沿用前面建立的樣本資料):
CREATE TABLE #r
(
	STID INT NOT NULL,
	[Subject] NVARCHAR(10) NOT NULL,
	[Level] INT NOT NULL
);
GO

-- 批次插入反轉置後的原始資料
INSERT INTO #r
SELECT STID, [Subject], [Level]
FROM (
	SELECT STID, Math, Social, Nature, English, Chinese, Music, Physic, Work
	FROM #t
) AS t
UNPIVOT (
	[Level] FOR [Subject] IN 
	(Math, Social, Nature, English, Chinese, Music, Physic, Work)
) AS unpvt
GO

-- 轉置為「級分各科人數統計」表 
SELECT [Level], Math, Social, Nature, English, Chinese, Music, Physic, Work
FROM (
	-- 彙總(計算人數)
	SELECT [Subject], [Level], COUNT(STID) AS Counter
	FROM #r
	GROUP BY [Subject], [Level]
) AS src
PIVOT
(
	SUM(Counter) FOR [Subject] IN
	(Math, Social, Nature, English, Chinese, Music, Physic, Work)
) AS pvt;
GO

-- 轉置為「學生各科成績級分」表
SELECT STID, Math, Social, Nature, English, Chinese, Music, Physic, Work
FROM (
	SELECT STID, [Subject], AVG([Level]) AS [Level]
	FROM #r
	GROUP BY STID, [Subject]
) AS src
PIVOT
(
	AVG([Level]) FOR [Subject] IN
	(Math, Social, Nature, English, Chinese, Music, Physic, Work)
) AS pvt;
GO

再次令人感到意外!假設一開始資料表設計是比較沒問題的,可以直接轉置輸出原發問者所要的報表,其查詢成本是 0.0162121

transpose table - 6.direct pivot cost

比起有問題的資料表先反轉置再轉置處理雖然有快一些,但差距其實不大,也許資料量一多節省的成本才會較為可觀吧…,但我仍然要強調資料表如何設計還是很重要就是了,至少以本例來說,隨著不同的觀測角度你不需再花費額外的處理,可以直接利用 PIVOT 彙轉總表(SQL Server 2000 可以參考前述解法二依樣畫葫蘆);若再考慮其他的應用,比方說求每位學生的平均級分,想想原發問者的資料存放結構其實仍是很不利的。

最後做個總結,本案例所帶出的幾個重點:
  • 不宜使用過多聯結(JOIN),尤其應少用外部聯結(OUTER JOIN),因為查找未符合的資料列需額外成本。
  • SQL 語法是基於集合的邏輯,天生就適合批次查詢資料、彙總計算,反觀若需逐筆操作通常效率極差。
  • 經驗很重要,但勇於嘗試、另闢蹊徑有時是獲得額外經驗的不二法門。

以上,這次的寶貴經驗在這篇整理出來,希望可以對大家有一點幫助,若寫得不好或有不對的地方再請各位多多包涵,不吝給予指正喔!