The Data Warehouse Problem

A post here offers to solve the following problem:

This one came in as a data warehouse problem in 1999. You have a history table of customer daily total purchases. The problem is to report just those  customers who decreased their purchase amounts on their most recent order placed with us.  We are trying to get an idea when people are saturated with whatever we are selling.  If their order level is holding steady we are happy with them.

Here is the table structure and a few rows of sample data:

CREATE TABLE DailySalesTotals (
    customer_id CHAR(10) NOT NULL,
    order_date DATE NOT NULL,
    order_amt DECIMAL(8,2) NOT NULL,
    PRIMARY KEY (customer_id, order_date)
);

INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Celko', '1999-11-28', 450.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Curly', '1999-11-25', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Curly', '1999-11-26', 300.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Curly', '1999-11-27', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Curly', '1999-11-28', 450.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Larry', '1999-11-25', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Larry', '1999-11-26', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Larry', '1999-11-27', 450.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Larry', '1999-11-28', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Moe', '1999-11-25', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Moe', '1999-11-26', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Moe', '1999-11-27', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Moe', '1999-11-28', 400.00);

The following is the original solution:

SELECT H1.customer_id, ' dropped purchase amount on ',
MAX(H1.order_date)
  FROM DailySalesTotals H1
 WHERE H1.order_amt
       < (SELECT H2.order_amt
            FROM DailySalesTotals H2
           WHERE H1.customer_id = H2.customer_id
             AND H2.order_date
                 = (SELECT MAX(order_date)
                      FROM DailySalesTotals H3
                     WHERE H1.customer_id = H3.customer_id
                      AND H1.order_date > H3.order_date
                   )
           )
  AND H1.order_date = (SELECT MAX( order_date)
                          FROM DailySalesTotals h4
                         WHERE h4.customer_id = H1.customer_id
                        )
 GROUP BY customer_id;

And here is what I came up with:

SELECT
    CUSTOMER_ID,
    ORDER_DATE
FROM (
    WITH RANKING AS (
        SELECT
            CUSTOMER_ID,
            ORDER_DATE,
            ORDER_AMT,
            RANK() OVER(PARTITION BY CUSTOMER_ID ORDER BY ORDER_DATE DESC) AS DATE_RANK
        FROM
            DailySalesTotals
    ) SELECT
        NEW.CUSTOMER_ID,
        NEW.ORDER_AMT - OLD.ORDER_AMT AS NET,
        NEW.ORDER_DATE
    FROM
        RANKING NEW,
        RANKING OLD
    WHERE
        NEW.CUSTOMER_ID = OLD.CUSTOMER_ID
        AND NEW.DATE_RANK = 1
        AND OLD.DATE_RANK = 2
)
WHERE
    NET < 0;

On a 130,000 rows of data, my query costs 1,895 vs. 361,770 of the original one. I am wondering if it's possible to do it more efficiently.

EDIT: and here is an absolutely beautiful ANSI-compliant query I found here that only costs 703:

SELECT
    CUSTOMER_ID, MAX(ORDER_DATE) AS ORDER_DATE
FROM
    (
        SELECT
            CUSTOMER_ID,
            ORDER_DATE,
            ORDER_AMT *
                CASE ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY order_date DESC)
                WHEN 1 THEN 1
                WHEN 2 THEN -1
                END AS total
        FROM
            DailySalesTotals
     )  T
GROUP BY
    CUSTOMER_ID
HAVING
    SUM(total) < 0;

EDIT2: A co-worker pointed out that using LAST/FIRST makes query even faster (cost 615):

SELECT
    CUSTOMER_ID,
    ORDER_DATE
FROM (
    WITH RANKING AS (
        SELECT
            CUSTOMER_ID,
            ORDER_DATE,
            ORDER_AMT,
            RANK() OVER(PARTITION BY CUSTOMER_ID ORDER BY ORDER_DATE DESC) AS DATE_RANK
        FROM
           DailySalesTotals
    ) SELECT
        CUSTOMER_ID,
        (
            MAX (ORDER_AMT) KEEP (DENSE_RANK LAST ORDER BY order_date) -
            MIN (ORDER_AMT) KEEP (DENSE_RANK FIRST ORDER BY order_date)
        ) AS NET,
        MAX(ORDER_DATE) AS ORDER_DATE
    FROM
        RANKING
    WHERE
        DATE_RANK < = 2
    GROUP BY CUSTOMER_ID
)
WHERE
    NET < 0;

debugging DB2 PL/SQL

Is such a pain sometimes, because dbms_output module takes a while to setup and work glitchlessly and it’s really hard to do quick debugging when you can’t print anything from your code (in addition, dbms_output doesn’t seem to be available in DB2 Express-C). I ended up doing this:


CREATE TABLE debug
(
    event_time   TIMESTAMP,
    message      VARCHAR(1000)
);

--#SET TERMINATOR #
CREATE PROCEDURE print(message VARCHAR(1000))
LANGUAGE SQL
begin
    INSERT INTO debug VALUES(current timestamp, message);
end

Then I can just use:

call print('Test!');

from anywhere in my code and see the debug messages using:

SELECT * FROM debug;