The Data Warehouse Problem

A post here offers to solve the following problem:

This one came in as a data warehouse problem in 1999. You have a history table of customer daily total purchases. The problem is to report just those  customers who decreased their purchase amounts on their most recent order placed with us.  We are trying to get an idea when people are saturated with whatever we are selling.  If their order level is holding steady we are happy with them.

Here is the table structure and a few rows of sample data:

CREATE TABLE DailySalesTotals (
    customer_id CHAR(10) NOT NULL,
    order_date DATE NOT NULL,
    order_amt DECIMAL(8,2) NOT NULL,
    PRIMARY KEY (customer_id, order_date)
);

INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Celko', '1999-11-28', 450.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Curly', '1999-11-25', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Curly', '1999-11-26', 300.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Curly', '1999-11-27', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Curly', '1999-11-28', 450.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Larry', '1999-11-25', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Larry', '1999-11-26', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Larry', '1999-11-27', 450.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Larry', '1999-11-28', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Moe', '1999-11-25', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Moe', '1999-11-26', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Moe', '1999-11-27', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Moe', '1999-11-28', 400.00);

The following is the original solution:

SELECT H1.customer_id, ' dropped purchase amount on ',
MAX(H1.order_date)
  FROM DailySalesTotals H1
 WHERE H1.order_amt
       < (SELECT H2.order_amt
            FROM DailySalesTotals H2
           WHERE H1.customer_id = H2.customer_id
             AND H2.order_date
                 = (SELECT MAX(order_date)
                      FROM DailySalesTotals H3
                     WHERE H1.customer_id = H3.customer_id
                      AND H1.order_date > H3.order_date
                   )
           )
  AND H1.order_date = (SELECT MAX( order_date)
                          FROM DailySalesTotals h4
                         WHERE h4.customer_id = H1.customer_id
                        )
 GROUP BY customer_id;

And here is what I came up with:

SELECT
    CUSTOMER_ID,
    ORDER_DATE
FROM (
    WITH RANKING AS (
        SELECT
            CUSTOMER_ID,
            ORDER_DATE,
            ORDER_AMT,
            RANK() OVER(PARTITION BY CUSTOMER_ID ORDER BY ORDER_DATE DESC) AS DATE_RANK
        FROM
            DailySalesTotals
    ) SELECT
        NEW.CUSTOMER_ID,
        NEW.ORDER_AMT - OLD.ORDER_AMT AS NET,
        NEW.ORDER_DATE
    FROM
        RANKING NEW,
        RANKING OLD
    WHERE
        NEW.CUSTOMER_ID = OLD.CUSTOMER_ID
        AND NEW.DATE_RANK = 1
        AND OLD.DATE_RANK = 2
)
WHERE
    NET < 0;

On a 130,000 rows of data, my query costs 1,895 vs. 361,770 of the original one. I am wondering if it's possible to do it more efficiently.

EDIT: and here is an absolutely beautiful ANSI-compliant query I found here that only costs 703:

SELECT
    CUSTOMER_ID, MAX(ORDER_DATE) AS ORDER_DATE
FROM
    (
        SELECT
            CUSTOMER_ID,
            ORDER_DATE,
            ORDER_AMT *
                CASE ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY order_date DESC)
                WHEN 1 THEN 1
                WHEN 2 THEN -1
                END AS total
        FROM
            DailySalesTotals
     )  T
GROUP BY
    CUSTOMER_ID
HAVING
    SUM(total) < 0;

EDIT2: A co-worker pointed out that using LAST/FIRST makes query even faster (cost 615):

SELECT
    CUSTOMER_ID,
    ORDER_DATE
FROM (
    WITH RANKING AS (
        SELECT
            CUSTOMER_ID,
            ORDER_DATE,
            ORDER_AMT,
            RANK() OVER(PARTITION BY CUSTOMER_ID ORDER BY ORDER_DATE DESC) AS DATE_RANK
        FROM
           DailySalesTotals
    ) SELECT
        CUSTOMER_ID,
        (
            MAX (ORDER_AMT) KEEP (DENSE_RANK LAST ORDER BY order_date) -
            MIN (ORDER_AMT) KEEP (DENSE_RANK FIRST ORDER BY order_date)
        ) AS NET,
        MAX(ORDER_DATE) AS ORDER_DATE
    FROM
        RANKING
    WHERE
        DATE_RANK < = 2
    GROUP BY CUSTOMER_ID
)
WHERE
    NET < 0;

Permutations, unique sets, long division, and other beasts

I had an interesting problem to solve at work today. The challenge level was high enough to be able to solve it elegantly in a few hours, except, of course, it took me almost a day, because finding a contiguous block of two hours nowadays seems impossible.

Imagine that you need to write a program (application, class, two methods, however you want to call it) that will do these two things:

1) Given a set of integer ranges

(0..1, 0..1, 0..1)

it should be able to sequentially list all permutations, like so:

0 0 0
0 0 1
0 1 0
...
1 1 1

Essentially, you writing a function that will be able to count in a k-numeral system. Here are a few conditions:
- must accept reasonable, but arbitrary number of ranges on the input,
- ranges can be any pair of integers a and b, where a <= b, for example, -1..0 and -234.. 534 are two valid ranges,
- ranges can obviously differ from each other, for example, if you input (0..1, -1..2), the program should produce something like this:

0 -1
0 0
0 1
0 2
...
1 2

Seems easy enough, just a lot of little traps that can get in a way.

2) By giving your algorithm a second parameter, an integer k, you should be able to specify the kth number of the permutation sequence, so that when you run your code, it will produce all values of the sequence starting with permutation k + 1. For instance if k=2 for (0..1, 0..1, 0..1) the output will look:

0 1 0
0 1 1
1 0 0
...
1 1 1

For the second part, I used long division to determine the permutation that will correspond to k. The interesting thing about this part is that essentially you are writing a decimal converter capable of converting a decimal representation of an integer into a k-numeral system.

Bonus marks: extend the program to work with other types, so that you can have mixed sets on inputs (a range of integers and a range of strings).

I think this can be made into a neat programming assignment question, no?