The Data Warehouse Problem
A post here offers to solve the following problem:
This one came in as a data warehouse problem in 1999. You have a history table of customer daily total purchases. The problem is to report just those customers who decreased their purchase amounts on their most recent order placed with us. We are trying to get an idea when people are saturated with whatever we are selling. If their order level is holding steady we are happy with them.
Here is the table structure and a few rows of sample data:
CREATE TABLE DailySalesTotals (
customer_id CHAR(10) NOT NULL,
order_date DATE NOT NULL,
order_amt DECIMAL(8,2) NOT NULL,
PRIMARY KEY (customer_id, order_date)
);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Celko', '1999-11-28', 450.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Curly', '1999-11-25', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Curly', '1999-11-26', 300.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Curly', '1999-11-27', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Curly', '1999-11-28', 450.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Larry', '1999-11-25', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Larry', '1999-11-26', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Larry', '1999-11-27', 450.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Larry', '1999-11-28', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Moe', '1999-11-25', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Moe', '1999-11-26', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Moe', '1999-11-27', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Moe', '1999-11-28', 400.00);
The following is the original solution:
SELECT H1.customer_id, ' dropped purchase amount on ',
MAX(H1.order_date)
FROM DailySalesTotals H1
WHERE H1.order_amt
< (SELECT H2.order_amt
FROM DailySalesTotals H2
WHERE H1.customer_id = H2.customer_id
AND H2.order_date
= (SELECT MAX(order_date)
FROM DailySalesTotals H3
WHERE H1.customer_id = H3.customer_id
AND H1.order_date > H3.order_date
)
)
AND H1.order_date = (SELECT MAX( order_date)
FROM DailySalesTotals h4
WHERE h4.customer_id = H1.customer_id
)
GROUP BY customer_id;
And here is what I came up with:
SELECT
CUSTOMER_ID,
ORDER_DATE
FROM (
WITH RANKING AS (
SELECT
CUSTOMER_ID,
ORDER_DATE,
ORDER_AMT,
RANK() OVER(PARTITION BY CUSTOMER_ID ORDER BY ORDER_DATE DESC) AS DATE_RANK
FROM
DailySalesTotals
) SELECT
NEW.CUSTOMER_ID,
NEW.ORDER_AMT - OLD.ORDER_AMT AS NET,
NEW.ORDER_DATE
FROM
RANKING NEW,
RANKING OLD
WHERE
NEW.CUSTOMER_ID = OLD.CUSTOMER_ID
AND NEW.DATE_RANK = 1
AND OLD.DATE_RANK = 2
)
WHERE
NET < 0;
On a 130,000 rows of data, my query costs 1,895 vs. 361,770 of the original one. I am wondering if it's possible to do it more efficiently.
EDIT: and here is an absolutely beautiful ANSI-compliant query I found here that only costs 703:
SELECT
CUSTOMER_ID, MAX(ORDER_DATE) AS ORDER_DATE
FROM
(
SELECT
CUSTOMER_ID,
ORDER_DATE,
ORDER_AMT *
CASE ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY order_date DESC)
WHEN 1 THEN 1
WHEN 2 THEN -1
END AS total
FROM
DailySalesTotals
) T
GROUP BY
CUSTOMER_ID
HAVING
SUM(total) < 0;
EDIT2: A co-worker pointed out that using LAST/FIRST makes query even faster (cost 615):
SELECT
CUSTOMER_ID,
ORDER_DATE
FROM (
WITH RANKING AS (
SELECT
CUSTOMER_ID,
ORDER_DATE,
ORDER_AMT,
RANK() OVER(PARTITION BY CUSTOMER_ID ORDER BY ORDER_DATE DESC) AS DATE_RANK
FROM
DailySalesTotals
) SELECT
CUSTOMER_ID,
(
MAX (ORDER_AMT) KEEP (DENSE_RANK LAST ORDER BY order_date) -
MIN (ORDER_AMT) KEEP (DENSE_RANK FIRST ORDER BY order_date)
) AS NET,
MAX(ORDER_DATE) AS ORDER_DATE
FROM
RANKING
WHERE
DATE_RANK < = 2
GROUP BY CUSTOMER_ID
)
WHERE
NET < 0;
debugging DB2 PL/SQL
Is such a pain sometimes, because dbms_output module takes a while to setup and work glitchlessly and it’s really hard to do quick debugging when you can’t print anything from your code (in addition, dbms_output doesn’t seem to be available in DB2 Express-C). I ended up doing this:
CREATE TABLE debug
(
event_time TIMESTAMP,
message VARCHAR(1000)
);
--#SET TERMINATOR #
CREATE PROCEDURE print(message VARCHAR(1000))
LANGUAGE SQL
begin
INSERT INTO debug VALUES(current timestamp, message);
end
Then I can just use:
call print('Test!');
from anywhere in my code and see the debug messages using:
SELECT * FROM debug;