The Data Warehouse Problem
A post here offers to solve the following problem:
This one came in as a data warehouse problem in 1999. You have a history table of customer daily total purchases. The problem is to report just those customers who decreased their purchase amounts on their most recent order placed with us. We are trying to get an idea when people are saturated with whatever we are selling. If their order level is holding steady we are happy with them.
Here is the table structure and a few rows of sample data:
CREATE TABLE DailySalesTotals (
customer_id CHAR(10) NOT NULL,
order_date DATE NOT NULL,
order_amt DECIMAL(8,2) NOT NULL,
PRIMARY KEY (customer_id, order_date)
);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Celko', '1999-11-28', 450.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Curly', '1999-11-25', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Curly', '1999-11-26', 300.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Curly', '1999-11-27', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Curly', '1999-11-28', 450.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Larry', '1999-11-25', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Larry', '1999-11-26', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Larry', '1999-11-27', 450.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Larry', '1999-11-28', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Moe', '1999-11-25', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Moe', '1999-11-26', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Moe', '1999-11-27', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Moe', '1999-11-28', 400.00);
The following is the original solution:
SELECT H1.customer_id, ' dropped purchase amount on ',
MAX(H1.order_date)
FROM DailySalesTotals H1
WHERE H1.order_amt
< (SELECT H2.order_amt
FROM DailySalesTotals H2
WHERE H1.customer_id = H2.customer_id
AND H2.order_date
= (SELECT MAX(order_date)
FROM DailySalesTotals H3
WHERE H1.customer_id = H3.customer_id
AND H1.order_date > H3.order_date
)
)
AND H1.order_date = (SELECT MAX( order_date)
FROM DailySalesTotals h4
WHERE h4.customer_id = H1.customer_id
)
GROUP BY customer_id;
And here is what I came up with:
SELECT
CUSTOMER_ID,
ORDER_DATE
FROM (
WITH RANKING AS (
SELECT
CUSTOMER_ID,
ORDER_DATE,
ORDER_AMT,
RANK() OVER(PARTITION BY CUSTOMER_ID ORDER BY ORDER_DATE DESC) AS DATE_RANK
FROM
DailySalesTotals
) SELECT
NEW.CUSTOMER_ID,
NEW.ORDER_AMT - OLD.ORDER_AMT AS NET,
NEW.ORDER_DATE
FROM
RANKING NEW,
RANKING OLD
WHERE
NEW.CUSTOMER_ID = OLD.CUSTOMER_ID
AND NEW.DATE_RANK = 1
AND OLD.DATE_RANK = 2
)
WHERE
NET < 0;
On a 130,000 rows of data, my query costs 1,895 vs. 361,770 of the original one. I am wondering if it's possible to do it more efficiently.
EDIT: and here is an absolutely beautiful ANSI-compliant query I found here that only costs 703:
SELECT
CUSTOMER_ID, MAX(ORDER_DATE) AS ORDER_DATE
FROM
(
SELECT
CUSTOMER_ID,
ORDER_DATE,
ORDER_AMT *
CASE ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY order_date DESC)
WHEN 1 THEN 1
WHEN 2 THEN -1
END AS total
FROM
DailySalesTotals
) T
GROUP BY
CUSTOMER_ID
HAVING
SUM(total) < 0;
EDIT2: A co-worker pointed out that using LAST/FIRST makes query even faster (cost 615):
SELECT
CUSTOMER_ID,
ORDER_DATE
FROM (
WITH RANKING AS (
SELECT
CUSTOMER_ID,
ORDER_DATE,
ORDER_AMT,
RANK() OVER(PARTITION BY CUSTOMER_ID ORDER BY ORDER_DATE DESC) AS DATE_RANK
FROM
DailySalesTotals
) SELECT
CUSTOMER_ID,
(
MAX (ORDER_AMT) KEEP (DENSE_RANK LAST ORDER BY order_date) -
MIN (ORDER_AMT) KEEP (DENSE_RANK FIRST ORDER BY order_date)
) AS NET,
MAX(ORDER_DATE) AS ORDER_DATE
FROM
RANKING
WHERE
DATE_RANK < = 2
GROUP BY CUSTOMER_ID
)
WHERE
NET < 0;
Permutations, unique sets, long division, and other beasts
I had an interesting problem to solve at work today. The challenge level was high enough to be able to solve it elegantly in a few hours, except, of course, it took me almost a day, because finding a contiguous block of two hours nowadays seems impossible.
Imagine that you need to write a program (application, class, two methods, however you want to call it) that will do these two things:
1) Given a set of integer ranges
(0..1, 0..1, 0..1)
it should be able to sequentially list all permutations, like so:
0 0 0 0 0 1 0 1 0 ... 1 1 1
Essentially, you writing a function that will be able to count in a k-numeral system. Here are a few conditions:
- must accept reasonable, but arbitrary number of ranges on the input,
- ranges can be any pair of integers a and b, where a <= b, for example, -1..0 and -234.. 534 are two valid ranges,
- ranges can obviously differ from each other, for example, if you input (0..1, -1..2), the program should produce something like this:
0 -1 0 0 0 1 0 2 ... 1 2
Seems easy enough, just a lot of little traps that can get in a way.
2) By giving your algorithm a second parameter, an integer k, you should be able to specify the kth number of the permutation sequence, so that when you run your code, it will produce all values of the sequence starting with permutation k + 1. For instance if k=2 for (0..1, 0..1, 0..1) the output will look:
0 1 0 0 1 1 1 0 0 ... 1 1 1
For the second part, I used long division to determine the permutation that will correspond to k. The interesting thing about this part is that essentially you are writing a decimal converter capable of converting a decimal representation of an integer into a k-numeral system.
Bonus marks: extend the program to work with other types, so that you can have mixed sets on inputs (a range of integers and a range of strings).
I think this can be made into a neat programming assignment question, no?