本笔记为阿里云天池龙珠计划SQL训练营的学习内容,链接为:https://tianchi.aliyun.com/specials/promotion/aicampsql
下载相关创建表格和插入数据脚本
这里为了方便大家,数据脚本已经给大家准备好啦。
点击下方链接直接下载
大家下载好脚本后,先在MySQL环境中运行 create_table.sql 脚本,创建数据表,然后解压下载好的 data.zip,解压后目录如下:
章节 | 脚本 |
---|---|
8,10 | ccf_offline_stage1_train.sql |
6 | winequality-white.sql |
5,8,10 | ccf_online_stage1_train.sql |
4 | macro industry.sql |
3 | ccf_offline_stage1_test_revised.sql |
2 | winequality-red.sql |
1,9 | income statement.sql |
1,7 | market data.sql |
脚本文件名前面的序号表示用到该数据集的题目序号,例如1,7 market data.sql
表示第 1 题和第 7 题用到了该数据集。
同样的,这里给大家的也是 SQL 脚本,里面是插入数据的语句,大家只需打开后在 MySQL 环境中运行即可将数据导入到数据表中。
往数据库中插入几百兆以上的 SQL 数据,既要考虑效率,也要避免因为一次性操作过多而导致数据库资源耗尽或超时问题。
以下是几种常见的解决方案和最佳实践:
- 批量插入(Batch Insert)
将数据分成小批量(如每批 1000 行或 10000 行)逐步插入,避免一次性插入过多数据。
INSERT INTO table_name (col1, col2, col3)
VALUES
(value1, value2, value3),
(value4, value5, value6),
...
;
优化建议:
- 使用事务:将每批插入操作包裹在事务中,提高效率。
BEGIN TRANSACTION;
INSERT INTO table_name ...;
INSERT INTO table_name ...;
COMMIT;
- 调整批量大小:根据数据库性能和网络带宽选择合适的批次大小。
- 通过工具加载(Load Data Tool)
大部分数据库提供高效加载大数据文件的工具,例如:
- MySQL: 使用 LOAD DATA INFILE 直接导入文件。
LOAD DATA INFILE 'file_path'
INTO TABLE table_name
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
(col1, col2, col3);
- PostgreSQL: 使用 COPY 命令。
COPY table_name (col1, col2, col3)
FROM 'file_path'
WITH (FORMAT CSV, HEADER TRUE);
- SQL Server: 使用 bcp 或 BULK INSERT。
BULK INSERT table_name
FROM 'file_path'
WITH (FIELDTERMINATOR = ',', ROWTERMINATOR = '\n');
优点:
- 工具级别优化。
- 数据库特定实现,性能更高。
- 使用程序语言驱动(Programmatic Insert)
通过编程语言(如 Python、Java、Go)将大文件分块读入内存并逐步插入数据库。
- Python 示例(适用 MySQL):
import pymysql
conn = pymysql.connect(host='localhost', user='user', password='password', database='dbname')
cursor = conn.cursor()
data = [(value1, value2, value3), (value4, value5, value6), ...]
batch_size = 1000
for i in range(0, len(data), batch_size):
batch = data[i:i + batch_size]
cursor.executemany("INSERT INTO table_name (col1, col2, col3) VALUES (%s, %s, %s)", batch)
conn.commit()
conn.close()
优化建议:
- 使用 executemany 提高插入效率。
- 异步插入或多线程插入。
- 分片插入(Partitioned Insert)
将数据文件按照某个字段分片,例如按日期、ID范围拆分成多个小文件。并行化插入到不同的分区或表中,再合并结果。
- 关闭约束和索引
- 在大量数据插入时,禁用表的外键约束和索引,可以显著提升性能。
ALTER TABLE table_name DISABLE KEYS;
-- 插入数据
ALTER TABLE table_name ENABLE KEYS;
- 适用场景:索引更新和约束检查对性能影响显著。
- 注意:插入完成后务必重新启用。
- 使用临时表
- 首先将数据插入临时表(无索引或约束),然后批量迁移到目标表。
INSERT INTO final_table (col1, col2, col3)
SELECT col1, col2, col3
FROM temp_table;
- 数据压缩与解压
-
对数据文件进行压缩(如 Gzip、Bzip2),通过工具或程序直接解压并导入。
-
PostgreSQL 示例:
gzip -c data.csv | psql -U user -d database -c "COPY table_name FROM STDIN WITH CSV"
- 分布式数据库(如 TiDB、CockroachDB)
对于海量数据,可以使用支持分布式写入的数据库,通过分布式机制并发导入。
- 流式插入(Streaming Inserts)
将数据按流(stream)读取并插入,适用于实时处理和插入的场景。
- 典型工具:Kafka + Sink Connector。
- 数据导入工具
使用专业的数据迁移工具:
- MySQL: MySQL Shell 的 importTable。
- PostgreSQL: pgloader、pg_bulkload。
- 通用: Talend、Airbyte、DBeaver 等。
性能调优建议
- 调整数据库配置:如 MySQL 的 innodb_buffer_pool_size、bulk_insert_buffer_size。
- 使用 SSD 或高性能磁盘。
- 避免网络瓶颈:在同一台服务器上执行。
如需根据具体数据库和环境选择最佳方案,可以详细提供上下文信息进一步优化!
请使用A股上市公司季度营收预测数据集《Income Statement.xls》和《Company Operating.xlsx》和《Market Data.xlsx》,以Market Data为主表,将三张表中的TICKER_SYMBOL为600383和600048的信息合并在一起。只需要显示以下字段。
表名 | 字段名 | 中文解释 |
---|---|---|
Income Statement | TICKER_SYMBOL | 股票代码 |
Income Statement | END_DATE | 截止日期 |
Income Statement | T_REVENUE | 总收入 |
Income Statement | T_COGS | 销售成本(营业成本) |
Income Statement | N_INCOME | 净利润 |
Market Data | TICKER_SYMBOL | 股票代码 |
Market Data | END_DATE_ | 截止日期(可能有特殊标记) |
Market Data | CLOSE_PRICE | 收盘价格 |
Company Operating | TICKER_SYMBOL | 股票代码 |
Company Operating | INDIC_NAME_EN | 指标英文名称 |
Company Operating | END_DATE | 截止日期 |
Company Operating | VALUE | 指标值 |
数据准备的步骤:
- fixed acidity(固定酸度):葡萄酒中无法蒸发的酸,如酒石酸。
- volatile acidity(挥发性酸度):易挥发酸的含量,如乙酸;高含量会导致醋味。
- citric acid(柠檬酸):微量增加新鲜感。
- residual sugar(残余糖分):发酵后剩余的糖分,影响甜度。
- chlorides(氯化物):葡萄酒中的盐分含量。
- free sulfur dioxide(游离二氧化硫):对微生物的抗性和抗氧化性。
- total sulfur dioxide(总二氧化硫):包括结合态和游离态二氧化硫。
- density(密度):与酒精含量和糖分有关。
- pH:酸碱度,影响味道平衡。
- sulphates(硫酸盐):增加葡萄酒的防腐性。
- alcohol(酒精含量):对感官影响显著。
请使用 Wine Quality Data 数据集《winequality-red.csv》,找出 pH=3.03的所有红葡萄酒,然后,对其 citric acid 进行中式排名(相同排名的下一个名次应该是下一个连续的整数值。换句话说,名次之间不应该有“间隔”)
Field | Description |
---|---|
User_id | 用户 ID |
Merchant_id | 商户 ID |
Coupon_id | 优惠券 ID:null 表示无优惠券消费,此时 Discount_rate 和 Date_received 字段无意义 |
Discount_rate | 优惠率:$x \in [0,1] $ 代表折扣率;x:y 表示满 x 减 y 。单位是元 |
Distance | user 经常活动的地点离该 merchant 的最近门店距离是 x*500 米(如果是连锁店,则取最近的一家门店),$x \in [0,10]$;null 表示无此信息,0 表示低于 500 米,10 表示大于 5 公里; |
Date_received | 领取优惠券日期 |
Date | 消费日期:如果 Date=null & Coupon_id != null,该记录表示领取优惠券但没有使用,即负样本;如果 Date!=null & Coupon_id = null,则表示普通消费日期;如果Date!=null & Coupon_id != null,则表示用优惠券消费日期,即正样本; |
Field | Description |
---|---|
User_id | 用户 ID |
Merchant_id | 商户 ID |
Action | 0 点击, 1 购买,2 领取优惠券 |
Coupon_id | 优惠券 ID:null 表示无优惠券消费,此时 Discount_rate 和 Date_received 字段无意义。“fixed” 表示该交易是限时低价活动。 |
Discount_rate | 优惠率:$x \in [0,1]$ 代表折扣率;x:y 表示满 x 减 y;“fixed” 表示低价限时优惠; |
Date_received | 领取优惠券日期 |
Date | 消费日期:如果 Date=null & Coupon_id != null,该记录表示领取优惠券但没有使用;如果 Date!=null & Coupon_id = null,则表示普通消费日期;如果 Date!=null & Coupon_id != null,则表示用优惠券消费日期; |
Field | Description |
---|---|
User_id | 用户 ID |
Merchant_id | 商户 ID |
Coupon_id | 优惠券 ID |
Discount_rate | 优惠率:$x \in [0,1]$ 代表折扣率;x:y 表示满 x 减 y. |
Distance | user 经常活动的地点离该 merchant 的最近门店距离是 x*500 米(如果是连锁店,则取最近的一家门店),$ x \in [0,10]$;null 表示无此信息,0 表示低于 500 米,10 表示大于 5 公里; |
Date_received | 领取优惠券日期 |
其中user_id,coupon_id和date_received均来自Table 3,而Probability为预测值
Field | Description |
---|---|
User_id | 用户 ID |
Coupon_id | 优惠券 ID |
Date_received | 领取优惠券日期 |
Probability | 15 天内用券概率,由参赛选手给出 |
使用Coupon Usage Data for O2O中的数据集《ccf_offline_stage1_test_revised.csv》,试分别找出在2016年7月期间,发放优惠券总金额最多和发放优惠券张数最多的商家。
这里只考虑满减的金额,不考虑打几折的优惠券。
请使用A股上市公司季度营收预测中的数据集《Macro&Industry.xlsx》中的sheet-INDIC_DATA,请计算:
全社会用电量:第一产业:当月值在2015年用电最高峰是发生在哪月?并且相比去年同期增长/减少了多少个百分比?
使用Coupon Usage Data for O2O中的数据集《ccf_online_stage1_train.csv》,试统计:
在2016年6月期间,线上总体优惠券弃用率为多少?并找出优惠券弃用率最高的商家。
请使用 Wine Quality Data 数据集《winequality-white.csv》,找出 pH=3.63的所有白葡萄酒,然后,对其 residual sugar 量进行英式排名(非连续的排名)
请使用 A 股上市公司季度营收预测中的数据集《Market Data.xlsx》中的 sheet-DATA,
- 计算截止到 2018 年底,市值最大的三个行业是哪些?
- 及这三个行业里市值最大的三个公司是哪些?(每个行业找出前三大的公司,即一共要找出9个)
使用Coupon Usage Data for O2O中的数据集《ccf_online_stage1_train.csv》和《ccf_offline_stage1_train.csv》,试找出在2016年6月期间,线上线下累计优惠券使用次数最多的顾客。
请使用A股上市公司季度营收预测数据集《Income Statement.xls》中的sheet-General Business和《Company Operating.xlsx》中的sheet-EN。
找出在数据集所有年份中,按季度统计,白云机场旅客吞吐量最高的那一季度对应的净利润是多少?(注意,是单季度对应的净利润,非累计净利润。)
使用Coupon Usage Data for O2O中的数据集《ccf_online_stage1_train.csv》和《ccf_offline_stage1_train.csv》,试找出在2016年6月期间,线上线下累计被使用优惠券满减最多的前3名商家。
比如商家A,消费者A在其中使用了一张200减50的,消费者B使用了一张30减1的,那么商家A累计被使用优惠券满减51元。
SELECT
i.TICKER_SYMBOL AS '股票代码',
i.END_DATE AS '截止日期',
i.T_REVENUE AS '总收入',
i.T_COGS AS '销售成本',
i.N_INCOME AS '净利润',
m.TICKER_SYMBOL AS '股票代码',
m.END_DATE AS '截止日期',
m.CLOSE_PRICE AS '收盘价格',
c.TICKER_SYMBOL AS '股票代码',
c.INDIC_NAME_EN AS '指标英文名称',
c.END_DATE AS '截止日期',
c.VALUE AS '指标值'
FROM(
SELECT
TICKER_SYMBOL,
END_DATE,
CLOSE_PRICE
FROM `market data`
WHERE TICKER_SYMBOL IN ('600383', '600048')
) m
LEFT JOIN `income statement` i ON m.TICKER_SYMBOL = i.TICKER_SYMBOL
LEFT JOIN `company operating` c ON m.TICKER_SYMBOL = c.TICKER_SYMBOL;
注:大家可以看到查询的速度比较慢,主要原因是关联的 TICKER_SYMBOL 字段没有加索引。
- 对关联字段 TICKER_SYMBOL 增加索引 SQL 脚本
ALTER TABLEmarket data
ADD INDEX INDEX_M_TICKER_SYMBOL (TICKER_SYMBOL);
ALTER TABLEincome statement
ADD INDEX INDEX_I_TICKER_SYMBOL (TICKER_SYMBOL);
ALTER TABLEcompany operating
ADD INDEX INDEX_C_TICKER_SYMBOL (TICKER_SYMBOL);
SELECT *, DENSE_RANK() OVER (ORDER BY `citric acid`) AS dense_ranking
FROM `winequality-red`
WHERE pH=3.03;
Field | Description |
---|---|
User_id | 用户ID |
Merchant_id | 商户ID |
Coupon_id | 优惠券ID |
Discount_rate | 优惠率: |
Distance | user经常活动的地点离该merchant的最近门店距离是 |
Date_received | 领取优惠券日期 |
- 发放优惠券总金额最高的商家
SELECT
m.*
FROM(
SELECT
Merchant_id AS '商户 ID',
SUM(SUBSTRING_INDEX(Discount_rate, ':', 1)) AS '发放优惠券总金额'
FROM ccf_offline_stage1_test_revised
WHERE
DATE_FORMAT(Date_received, '%Y-%m') LIKE '2016-07%' -- 7月份
AND
Discount_rate LIKE '%:%' -- 优惠券
GROUP BY Merchant_id
) m
ORDER BY 发放优惠券总金额 DESC
LIMIT 1;
- 发放优惠券张数最多的商家
SELECT m.*
FROM(
SELECT
Merchant_id,
COUNT(Coupon_id) AS Coupon_Num
FROM ccf_offline_stage1_test_revised
WHERE
DATE_FORMAT(Date_received, '%Y-%m') LIKE '2016-07%'
AND
Discount_rate LIKE '%:%'
GROUP BY Merchant_id
) m
ORDER BY Coupon_Num DESC
LIMIT 1;
- 2015年用电最高峰的月份
SELECT
m.peak_month,
SUM(m.DATA_VALUE) as DATA_VALUE
FROM (
SELECT
DATA_VALUE,
DATE_FORMAT(PERIOD_DATE, '%Y-%m') AS peak_month
FROM `macro industry`
WHERE
DATE_FORMAT(PERIOD_DATE, '%Y') LIKE '2015%'
) m
GROUP BY m.peak_month
LIMIT 1
;
- 2014年12月的用电量
SELECT
m.peak_month,
SUM(m.DATA_VALUE) as DATA_VALUE
FROM(
SELECT
DATA_VALUE, DATE_FORMAT(PERIOD_DATE, '%Y-%m') AS peak_month
FROM `macro industry`
WHERE
DATE_FORMAT(PERIOD_DATE, '%Y-%m') LIKE '2014-12%'
) m
GROUP BY m.peak_month;
SELECT
CONCAT(
FORMAT(((4268310.33806 - 3594152.12446) / 4268310.33806) * 100, 2),
'%') AS increase_rate;
SELECT
CONCAT(
FORMAT(
(
(
----2015年用电量最大
(
SELECT
m1.DATA_VALUE
FROM (
SELECT
m.peak_month,
SUM(m.DATA_VALUE) as DATA_VALUE
FROM (
SELECT
DATA_VALUE,
DATE_FORMAT(PERIOD_DATE, '%Y-%m') AS peak_month
FROM `macro industry`
WHERE DATE_FORMAT(PERIOD_DATE, '%Y') LIKE '2015%'
) m
GROUP BY m.peak_month
) m1
ORDER BY m1.DATA_VALUE DESC
LIMIT 1
)
-
-----2014 年 12 月份总量
(
SELECT
SUM(m.DATA_VALUE) as DATA_VALUE
FROM (
SELECT
DATA_VALUE,
DATE_FORMAT(PERIOD_DATE, '%Y-%m') AS peak_month
FROM `macro industry`
WHERE DATE_FORMAT(PERIOD_DATE, '%Y-%m') LIKE '2014-12%'
) m
GROUP BY m.peak_month
)
)
/
(
SELECT
m1.DATA_VALUE
FROM (
SELECT
m.peak_month,
SUM(m.DATA_VALUE) as DATA_VALUE
FROM (
SELECT
DATA_VALUE,
DATE_FORMAT(PERIOD_DATE, '%Y-%m') AS peak_month
FROM `macro industry`
WHERE DATE_FORMAT(PERIOD_DATE, '%Y') LIKE '2015%'
) m
GROUP BY m.peak_month) m1
ORDER BY m1.DATA_VALUE DESC
LIMIT 1
)
)
*
100
,
2
),
'%'
) AS increase_rate;
上面的写法,可以给 60 分,算是一个及格分。
SQL 语句确实较为复杂,主要问题在于嵌套查询过多,难以维护和理解。可以通过以下优化方法简化结构,同时提升可读性和执行效率:
优化建议
- 分步查询:分解逻辑
使用 Common Table Expressions (CTEs) 或 子查询表 分步骤计算。这样可以将每一步的逻辑拆分开,使得代码更易读。
- 减少重复计算
查询中有多次重复计算,例如 2015 年度数据的最大值查询,应该将这部分逻辑单独提取出来,避免多次重复执行。
- 避免不必要的分组和排序
如果某些分组和排序操作只用于单条结果,可以直接用聚合函数(如 MAX() 和 SUM())替代复杂的 ORDER BY 和 LIMIT。
- 命名字段
为字段和中间计算命名,提升代码的可读性。
优化后的代码
WITH
-- 2015 年每月总和及最大值
monthly_data_2015 AS (
SELECT
DATE_FORMAT(PERIOD_DATE, '%Y-%m') AS peak_month,
SUM(DATA_VALUE) AS total_value
FROM `macro industry`
WHERE DATE_FORMAT(PERIOD_DATE, '%Y') = '2015'
GROUP BY DATE_FORMAT(PERIOD_DATE, '%Y-%m')
),
-- 2015 年最大月数据
max_month_2015 AS (
SELECT
MAX(total_value) AS max_value
FROM monthly_data_2015
),
-- 2014 年 12 月的总和数据
december_2014_data AS (
SELECT
SUM(DATA_VALUE) AS total_value
FROM `macro industry`
WHERE DATE_FORMAT(PERIOD_DATE, '%Y-%m') = '2014-12'
)
-- 计算增长率
SELECT
CONCAT(
FORMAT(
(
(max_month_2015.max_value - december_2014_data.total_value)
/
max_month_2015.max_value
) * 100,
2
),
'%'
) AS increase_rate
FROM max_month_2015, december_2014_data;
上面的写法好了一些,但是还是得不到高分,70 分吧。
上面的写法中 2014-12 还是写死的,还不是完美的解决方案:
如果要变成动态的可以使用下面的方法:
WITH
-- 2015 年每月总和及最大值
monthly_data_2015 AS (
SELECT
DATE_FORMAT(PERIOD_DATE, '%Y-%m') AS peak_month,
MONTH(PERIOD_DATE) AS month_number,
SUM(DATA_VALUE) AS total_value
FROM `macro industry`
WHERE DATE_FORMAT(PERIOD_DATE, '%Y') = '2015'
GROUP BY DATE_FORMAT(PERIOD_DATE, '%Y-%m'), MONTH(PERIOD_DATE)
),
-- 获取 2015 年数据最大值的月份编号
max_month_2015 AS (
SELECT
peak_month,
total_value AS max_value,
MONTH(STR_TO_DATE(peak_month, '%Y-%m')) AS max_month_number
FROM monthly_data_2015
ORDER BY total_value DESC
LIMIT 1
),
-- 动态生成目标月份
target_month AS (
SELECT
max_month_number AS target_month_number
FROM max_month_2015
),
-- 获取动态计算的目标月份对应的 2014 年数据
dynamic_month_data AS (
SELECT
SUM(DATA_VALUE) AS total_value
FROM `macro industry`,
target_month -- 引入 target_month
WHERE DATE_FORMAT(PERIOD_DATE, '%Y-%m') = CONCAT('2014-', LPAD(target_month.target_month_number, 2, '0'))
)
-- 计算增长率
SELECT
CONCAT(
FORMAT(
(
(max_month_2015.max_value - dynamic_month_data.total_value)
/
max_month_2015.max_value
) * 100,
2
),
'%'
) AS increase_rate
FROM max_month_2015, dynamic_month_data;
- 被领券但未使用的优惠券张数
SELECT
COUNT(Coupon_id) AS Coupon_num
FROM ccf_online_stage1_train
WHERE
Coupon_id != 'fixed'
AND
Date IS NULL
AND
Coupon_id IS NOT NULL
AND
DATE_FORMAT(Date_received, '%Y-%m') LIKE '2016-06%';
- 总的被领用的优惠券张数
SELECT (
SELECT COUNT(Coupon_id) AS Coupon_num
FROM ccf_online_stage1_train
WHERE
Coupon_id != 'fixed'
AND
Date IS NULL
AND
Coupon_id IS NOT NULL
AND
DATE_FORMAT(Date_received, '%Y-%m') LIKE '2016-06%'
)
+
(
SELECT COUNT(Coupon_id) AS Coupon_num
FROM ccf_online_stage1_train
WHERE
Coupon_id != 'fixed'
AND
Date IS NOT NULL
AND
Date_received IS NOT NULL
AND DATE_FORMAT(Date_received, '%Y-%m') LIKE '2016-06%'
) AS sum_coupon_num;
- 弃用率
SELECT CONCAT(FORMAT((79053/90406) * 100, 2), '%') give_up_rate;
SELECT CONCAT(FORMAT(((SELECT COUNT(Coupon_id) AS coupon_num
FROM ccf_online_stage1_train
WHERE Coupon_id != 'fixed' AND Date IS NULL AND Coupon_id IS NOT NULL AND DATE_FORMAT(Date_received, '%Y-%m') LIKE '2016-06%')/(SELECT (SELECT COUNT(Coupon_id) AS Coupon_num
FROM ccf_online_stage1_train
WHERE Coupon_id != 'fixed' AND Date IS NULL AND Coupon_id IS NOT NULL AND DATE_FORMAT(Date_received, '%Y-%m') LIKE '2016-06%') + (SELECT COUNT(Coupon_id) AS Coupon_num
FROM ccf_online_stage1_train
WHERE Coupon_id != 'fixed' AND Date IS NOT NULL AND Date_received IS NOT NULL AND DATE_FORMAT(Date_received, '%Y-%m') LIKE '2016-06%') AS sum_coupon_num)) * 100, 2), '%') give_up_rate;
换用一种更好的方法:
WITH
-- 统计 2016-06 中未使用的优惠券数量
unused_coupons AS (
SELECT
COUNT(Coupon_id) AS coupon_num
FROM ccf_online_stage1_train
WHERE
Coupon_id != 'fixed'
AND Date IS NULL
AND Coupon_id IS NOT NULL
AND DATE_FORMAT(Date_received, '%Y-%m') = '2016-06'
),
-- 统计 2016-06 中使用的优惠券数量
used_coupons AS (
SELECT
COUNT(Coupon_id) AS coupon_num
FROM ccf_online_stage1_train
WHERE
Coupon_id != 'fixed'
AND Date IS NOT NULL
AND Date_received IS NOT NULL
AND DATE_FORMAT(Date_received, '%Y-%m') = '2016-06'
),
-- 计算 2016-06 的优惠券总数量
total_coupons AS (
SELECT
unused_coupons.coupon_num + used_coupons.coupon_num AS sum_coupon_num
FROM unused_coupons, used_coupons
)
-- 计算放弃率
SELECT
CONCAT(
FORMAT(
(unused_coupons.coupon_num / total_coupons.sum_coupon_num) * 100,
2
),
'%'
) AS give_up_rate
FROM unused_coupons, total_coupons;
- 优惠券弃用率最高的商家
SELECT m4.*
FROM
(SELECT m3.Merchant_id, m3.unused_coupon_num, m3.used_coupon_num, m3.sum_coupon_num, FORMAT((m3.unused_coupon_num/m3.sum_coupon_num) * 100, 2) AS give_up_rate
FROM
(SELECT m1.Merchant_id, m1.unused_coupon_num, m2.used_coupon_num, (m1.unused_coupon_num + m2.used_coupon_num) AS sum_coupon_num
FROM
(SELECT Merchant_id, COUNT(Coupon_id) AS unused_coupon_num
FROM ccf_online_stage1_train
WHERE Coupon_id != 'fixed' AND Date IS NULL AND Coupon_id IS NOT NULL AND DATE_FORMAT(Date_received, '%Y-%m') LIKE '2016-06%'
GROUP BY Merchant_id) m1
INNER JOIN
(SELECT Merchant_id, COUNT(Coupon_id) AS used_coupon_num
FROM ccf_online_stage1_train
WHERE Coupon_id != 'fixed' AND Date IS NOT NULL AND Date_received IS NOT NULL AND DATE_FORMAT(Date_received, '%Y-%m') LIKE '2016-06%'
GROUP BY Merchant_id) m2
ON m1.Merchant_id = m2.Merchant_id) m3
) m4
ORDER BY m4.give_up_rate DESC
LIMIT 1;
另外一种写法:
WITH
-- 统计每个商家的未使用优惠券数量
unused_coupons AS (
SELECT
Merchant_id,
COUNT(Coupon_id) AS unused_coupon_num
FROM ccf_online_stage1_train
WHERE
Coupon_id != 'fixed'
AND Date IS NULL
AND Coupon_id IS NOT NULL
AND DATE_FORMAT(Date_received, '%Y-%m') = '2016-06'
GROUP BY Merchant_id
),
-- 统计每个商家的已使用优惠券数量
used_coupons AS (
SELECT
Merchant_id,
COUNT(Coupon_id) AS used_coupon_num
FROM ccf_online_stage1_train
WHERE
Coupon_id != 'fixed'
AND Date IS NOT NULL
AND Date_received IS NOT NULL
AND DATE_FORMAT(Date_received, '%Y-%m') = '2016-06'
GROUP BY Merchant_id
),
-- 计算每个商家的总优惠券数量及放弃率
coupon_stats AS (
SELECT
m1.Merchant_id,
m1.unused_coupon_num,
m2.used_coupon_num,
(m1.unused_coupon_num + m2.used_coupon_num) AS sum_coupon_num,
FORMAT((m1.unused_coupon_num / (m1.unused_coupon_num + m2.used_coupon_num)) * 100, 2) AS give_up_rate
FROM unused_coupons m1
INNER JOIN used_coupons m2
ON m1.Merchant_id = m2.Merchant_id
)
-- 按放弃率降序排序,并取最高的商家
SELECT
*
FROM
coupon_stats
ORDER BY
give_up_rate DESC
LIMIT 1;
SELECT *, RANK() OVER (ORDER BY `residual sugar`) AS ranking
FROM `winequality-white`
WHERE pH=3.63;
-- 获取市值最高的 3 个行业
SELECT `market data`.TYPE_ID,`market data`.TYPE_NAME_CN,
SUM(CAST(MARKET_VALUE AS SIGNED)) AS MV, END_DATE
FROM `market data`
WHERE DATE_FORMAT(END_DATE, '%Y-%m-%d') = '2018-05-31'
GROUP BY TYPE_NAME_CN
ORDER BY MV desc
LIMIT 3;
-- UNION 在一起
(
SELECT
m.SECURITY_ID,
m.TYPE_ID,
m.TYPE_NAME_CN,
m.MARKET_VALUE
FROM (
SELECT
SECURITY_ID,
TYPE_ID,
TYPE_NAME_CN,
CAST(MARKET_VALUE AS SIGNED) AS MARKET_VALUE, END_DATE
FROM `market data`
WHERE
TYPE_ID='010303210101'
AND
DATE_FORMAT(END_DATE, '%Y-%m-%d') = '2018-05-31'
) m
ORDER BY m.MARKET_VALUE DESC LIMIT 3
)
UNION ALL
(
SELECT
m.SECURITY_ID,
m.TYPE_ID,
m.TYPE_NAME_CN,
m.MARKET_VALUE
FROM (
SELECT
SECURITY_ID,
TYPE_ID,
TYPE_NAME_CN,
CAST(MARKET_VALUE AS SIGNED) AS MARKET_VALUE,
END_DATE
FROM `market data`
WHERE
TYPE_ID='010303200101'
AND
DATE_FORMAT(END_DATE, '%Y-%m-%d') = '2018-05-31'
) m
ORDER BY m.MARKET_VALUE DESC LIMIT 3
)
UNION ALL
(
SELECT
m.SECURITY_ID,
m.TYPE_ID,
m.TYPE_NAME_CN,
m.MARKET_VALUE
FROM (
SELECT
SECURITY_ID,
TYPE_ID,
TYPE_NAME_CN,
CAST(MARKET_VALUE AS SIGNED) AS MARKET_VALUE,
END_DATE
FROM `market data`
WHERE
TYPE_ID='010303170301'
AND
DATE_FORMAT(END_DATE, '%Y-%m-%d') = '2018-05-31'
) m
ORDER BY m.MARKET_VALUE DESC LIMIT 3);
上面的写法有一个问题就是分开了查询,在实际的操作中,是不现实的。正确的做法可以使用子查询和窗口函数解决。
WITH
-- 获取 MARKET_VALUE 总值最大的前 3 个 TYPE_ID
top_types AS (
SELECT
TYPE_ID,
TYPE_NAME_CN,
SUM(CAST(MARKET_VALUE AS SIGNED)) AS MV
FROM `market data`
WHERE DATE_FORMAT(END_DATE, '%Y-%m-%d') = '2018-05-31'
GROUP BY TYPE_ID, TYPE_NAME_CN
ORDER BY MV DESC
LIMIT 3
),
-- 获取符合条件的所有记录
filtered_data AS (
SELECT
md.SECURITY_ID,
md.TYPE_ID,
tt.TYPE_NAME_CN,
CAST(md.MARKET_VALUE AS SIGNED) AS MARKET_VALUE,
md.END_DATE
FROM `market data` md
INNER JOIN top_types tt
ON md.TYPE_ID = tt.TYPE_ID
WHERE DATE_FORMAT(md.END_DATE, '%Y-%m-%d') = '2018-05-31'
),
-- 对每种 TYPE_ID 按 MARKET_VALUE 排序并取前 3 条
ranked_data AS (
SELECT
fd.SECURITY_ID,
fd.TYPE_ID,
fd.TYPE_NAME_CN,
fd.MARKET_VALUE,
ROW_NUMBER() OVER (
PARTITION BY fd.TYPE_ID
ORDER BY fd.MARKET_VALUE DESC
) AS row_num
FROM filtered_data fd
)
-- 提取每种 TYPE_ID 的前 3 条记录
SELECT
SECURITY_ID,
TYPE_ID,
TYPE_NAME_CN,
MARKET_VALUE
FROM ranked_data
WHERE row_num <= 3
ORDER BY TYPE_ID, MARKET_VALUE DESC;
SELECT m.User_id, m.Coupon_num
FROM (
SELECT c.User_id, COUNT(c.Coupon_id) as Coupon_num
FROM (
(
SELECT User_id, Coupon_id
FROM ccf_offline_stage1_train
WHERE
Date IS NOT NULL
AND
Date_received IS NOT NULL
AND
DATE_FORMAT(Date_received, '%Y-%m') LIKE '2016-06%'
)
UNION ALL
(
SELECT User_id, Coupon_id
FROM ccf_online_stage1_train
WHERE
Coupon_id != 'fixed'
AND
Date IS NOT NULL
AND
Date_received IS NOT NULL
AND
DATE_FORMAT(Date_received, '%Y-%m') LIKE '2016-06%'
)
) c
GROUP BY c.User_id
) m
ORDER BY m.Coupon_num DESC
LIMIT 1;
SELECT * FROM (
SELECT
TICKER_SYMBOL,
YEAR(END_DATE) Year,
QUARTER(END_DATE) QUARTER,
SUM(VALUE) Amount
FROM `company operating`
WHERE INDIC_NAME_EN = 'Baiyun Airport:Passenger throughput'
GROUP BY TICKER_SYMBOL,YEAR(END_DATE),QUARTER(END_DATE)
ORDER BY SUM(VALUE) DESC
LIMIT 1
) BaseData
LEFT JOIN –--- income statement
(
SELECT
TICKER_SYMBOL,
YEAR(END_DATE) Year,
QUARTER(END_DATE) QUARTER,
SUM(N_INCOME) Amount
FROM `income statement`
GROUP BY TICKER_SYMBOL,YEAR(END_DATE),QUARTER(END_DATE)
) Income
ON BaseData.TICKER_SYMBOL = Income.TICKER_SYMBOL
AND BaseData.Year = Income.Year
AND BaseData.QUARTER = Income.QUARTER;
SELECT s.Merchant_id, s.Coupon_sum
FROM (
SELECT m.Merchant_id, SUM(m.sum_) AS Coupon_sum
FROM (
SELECT
Merchant_id,
SUM(SUBSTRING_INDEX(Discount_rate, ':', -1)) OVER (PARTITION BY Merchant_id) AS sum_
FROM ccf_online_stage1_train
WHERE
Coupon_id != 'fixed'
AND
Date IS NOT NULL
AND
Date_received IS NOT NULL
AND
DATE_FORMAT(Date, '%Y-%m') LIKE '2016-06%'
UNION ALL
SELECT
Merchant_id,
SUM(SUBSTRING_INDEX(Discount_rate, ':', -1)) OVER (PARTITION BY Merchant_id) AS sum_
FROM ccf_offline_stage1_train
WHERE
Date IS NOT NULL
AND
Date_received IS NOT NULL
AND
DATE_FORMAT(Date, '%Y-%m') LIKE '2016-06%'
) m
GROUP BY m.Merchant_id
) s
ORDER BY s.Coupon_sum DESC
LIMIT 3;