The Data Warehouse Problem

A post here offers to solve the following problem:

This one came in as a data warehouse problem in 1999. You have a history table of customer daily total purchases. The problem is to report just those  customers who decreased their purchase amounts on their most recent order placed with us.  We are trying to get an idea when people are saturated with whatever we are selling.  If their order level is holding steady we are happy with them.

Here is the table structure and a few rows of sample data:

CREATE TABLE DailySalesTotals (
    customer_id CHAR(10) NOT NULL,
    order_date DATE NOT NULL,
    order_amt DECIMAL(8,2) NOT NULL,
    PRIMARY KEY (customer_id, order_date)
);

INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Celko', '1999-11-28', 450.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Curly', '1999-11-25', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Curly', '1999-11-26', 300.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Curly', '1999-11-27', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Curly', '1999-11-28', 450.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Larry', '1999-11-25', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Larry', '1999-11-26', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Larry', '1999-11-27', 450.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Larry', '1999-11-28', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Moe', '1999-11-25', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Moe', '1999-11-26', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Moe', '1999-11-27', 400.00);
INSERT INTO DailySalesTotals (customer_id, order_date, order_amt) VALUES ('Moe', '1999-11-28', 400.00);

The following is the original solution:

SELECT H1.customer_id, ' dropped purchase amount on ',
MAX(H1.order_date)
  FROM DailySalesTotals H1
 WHERE H1.order_amt
       < (SELECT H2.order_amt
            FROM DailySalesTotals H2
           WHERE H1.customer_id = H2.customer_id
             AND H2.order_date
                 = (SELECT MAX(order_date)
                      FROM DailySalesTotals H3
                     WHERE H1.customer_id = H3.customer_id
                      AND H1.order_date > H3.order_date
                   )
           )
  AND H1.order_date = (SELECT MAX( order_date)
                          FROM DailySalesTotals h4
                         WHERE h4.customer_id = H1.customer_id
                        )
 GROUP BY customer_id;

And here is what I came up with:

SELECT
    CUSTOMER_ID,
    ORDER_DATE
FROM (
    WITH RANKING AS (
        SELECT
            CUSTOMER_ID,
            ORDER_DATE,
            ORDER_AMT,
            RANK() OVER(PARTITION BY CUSTOMER_ID ORDER BY ORDER_DATE DESC) AS DATE_RANK
        FROM
            DailySalesTotals
    ) SELECT
        NEW.CUSTOMER_ID,
        NEW.ORDER_AMT - OLD.ORDER_AMT AS NET,
        NEW.ORDER_DATE
    FROM
        RANKING NEW,
        RANKING OLD
    WHERE
        NEW.CUSTOMER_ID = OLD.CUSTOMER_ID
        AND NEW.DATE_RANK = 1
        AND OLD.DATE_RANK = 2
)
WHERE
    NET < 0;

On a 130,000 rows of data, my query costs 1,895 vs. 361,770 of the original one. I am wondering if it's possible to do it more efficiently.

EDIT: and here is an absolutely beautiful ANSI-compliant query I found here that only costs 703:

SELECT
    CUSTOMER_ID, MAX(ORDER_DATE) AS ORDER_DATE
FROM
    (
        SELECT
            CUSTOMER_ID,
            ORDER_DATE,
            ORDER_AMT *
                CASE ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY order_date DESC)
                WHEN 1 THEN 1
                WHEN 2 THEN -1
                END AS total
        FROM
            DailySalesTotals
     )  T
GROUP BY
    CUSTOMER_ID
HAVING
    SUM(total) < 0;

EDIT2: A co-worker pointed out that using LAST/FIRST makes query even faster (cost 615):

SELECT
    CUSTOMER_ID,
    ORDER_DATE
FROM (
    WITH RANKING AS (
        SELECT
            CUSTOMER_ID,
            ORDER_DATE,
            ORDER_AMT,
            RANK() OVER(PARTITION BY CUSTOMER_ID ORDER BY ORDER_DATE DESC) AS DATE_RANK
        FROM
           DailySalesTotals
    ) SELECT
        CUSTOMER_ID,
        (
            MAX (ORDER_AMT) KEEP (DENSE_RANK LAST ORDER BY order_date) -
            MIN (ORDER_AMT) KEEP (DENSE_RANK FIRST ORDER BY order_date)
        ) AS NET,
        MAX(ORDER_DATE) AS ORDER_DATE
    FROM
        RANKING
    WHERE
        DATE_RANK < = 2
    GROUP BY CUSTOMER_ID
)
WHERE
    NET < 0;

Heap, heap hooray!

“Tanks are not afraid of mud” – an ironic Russian expression figuratively meaning that some people are not scared, or even sometimes prefer to choose a long way over a short one, for no apparent reason. Sometimes I like to do that too.

At some point, yesterday night, I’ve decided that the best way to approach my data structures’ course assignment question on binary heaps (insert a list of elements in the heap, draw a list of diagrams showing the resulting tree after every insertion) would be spending half an hour writing a Python script that would do something similar :) So I looked at Graphviz (which by the way, is awesome – you can use it for describing complex diagrams in plain text and then generating images from it) and came up with a half-completed binary heap structure (I only needed insertion) and a script that will generate digraphs after every insertion, both of them  here. Nothing special at all -  but I like how I can now continue typing up my assignment in LyX, or maybe I just wanted an excuse to play with Python (:

These are how the pictures look for [10, 5, 12, 3, 2, 1, 8, 7, 9, 4]:

 

 

How to compare CLOBs in DB2 with a UDF

Although sometimes it might be less than desirable to do CLOB comparison in SQL statements, suppose you need to write a SELECT that compares two CLOB fields in the where clause:

SELECT COUNT(*) FROM tableA, tableB WHERE tableA.clob = tableB.clob

Seems easy enough, except when you try to run it, DB2 will make a sad face:

SQL0401N The data types of the operands for the operation “=” are not compatible or comparable. SQLSTATE=42818

A few solutions come to mind:

  • If you are writing an external application, you can do comparison in the application code – retrieve CLOBs, pass them to a method, get a result, carry on. This is kinda inefficient for remote applications (network overhead) plus not elegant, considering the extra logic you need to put in the application;
  • You can use an additional column in both of the tables to store corresponding CLOB’s hash, update hash whenever CLOB value has changed, and use the value of the hash field in the comparison operations. This approach will obviously fail if you rely on an exact match since hash values may collide;
  • You can use DBMS_LOB module that provides a COMPARE function. With this function, the query in question becomes:
    "SELECT COUNT(*) from tableA, tableB WHERE DBMS_LOB.COMPARE(tableA.clob, tableB.clob) = 0"

    This would be my solution of choice if not for the fact that it’s only available as of DB2 9.7;

  • Write a UDF that will do comparison for you and will essentially replace the missing COMPARE function from DBMS_LOB module.

I had to deal with DB2 9.5 and didn’t have much options, so I decided to go with a UDF. Here’s how it worked out:

1. First, let’s lay down a few rules. In my case, two CLOBs were considered equal if:

  • neither of the CLOBs was NULL;
  • sizes of the two CLOBs were equal;
  • pairs of characters at the corresponding positions in both CLOBs matched.

Further more, the function was to return 1 if two CLOBs were equal and 0 in all other cases.

2. Here is a Java class with a single method that solves the problem:

import java.io.IOException;
import java.util.Arrays;
import COM.ibm.db2.app.Clob;
import COM.ibm.db2.app.UDF;

public class ClobCompare extends UDF {

    /** Returns 1 if the both CLOBs are the equal, and 0 otherwise. */
    public void clobsEqual(COM.ibm.db2.app.Clob clobA, COM.ibm.db2.app.Clob clobB, int result) throws Exception{

        if (clobA == null || clobB == null){
            set(3, 0);
            return;
        }

       char[] clobCharArrayA = null;
       char[] clobCharArrayB = null;

       // Allocate space for CLOB arrays.
       try {
           clobCharArrayA = new char[(new Long(clobA.size())).intValue()];
           clobCharArrayB = new char[(new Long(clobB.size())).intValue()];
       } catch (IOException e){
           System.out.println(e.getMessage());
       }

       // Read CLOBs.
       try {
          clobA.getReader().read(clobCharArrayA);
          clobB.getReader().read(clobCharArrayB);
       } catch (IOException e){
           System.out.println(e.getMessage());
       }

       // Compare.
       if (Arrays.equals(clobCharArrayA, clobCharArrayB)){
          set(3, 1);
       } else {
          set(3, 0);
       }
    }
}

By the way, since it’s a UDF, you could handle exceptions better (namely, by signaling SQLSTATE) although I did not do it for the sake of keeping this example short.

3. Now we need to compile the code and drop the class file somewhere DB2 will be able to find it:

/opt/ibm/db2/V9.5/java/jdk64/bin/javac ClobCompare.java
cp ClobCompare.class /home/db2inst1/sqllib/function/

4. Almost there, just need to create a function within the specific database:

CONNECT DB TEST;

CREATE FUNCTION clobsEqual(a CLOB(1048576), b CLOB(1048576))
RETURNS INTEGER
LANGUAGE Java
EXTERNAL NAME 'ClobCompare!clobsEqual'
FENCED
THREADSAFE
NO SQL
NOT NULL CALL
NO EXTERNAL ACTION
DISALLOW PARALLEL
PARAMETER STYLE db2general;

5. Let’s check if it works:

db2 "values clobsEqual(clob('abc'), clob('abc'))"

If you did everything right, you will see a ’1′ returned. If DB2 dies, make sure you put class in the right folder and check the names of the class / method if you did any changes. Also, if you can’t find the problem right away, try looking in db2diag.log for more info (located in ~/sqllib/db2dump by default).

5. As a side note, if you recompile your code, you do not need to recreate the function. To make sure DB2 uses the latest compiled code available it is sufficient to do this:

db2 "call sqlj.refresh_classes()"

while connected to the database.

Assuming we’ve completed all of the above steps, we can now rewrite our query as:

"SELECT COUNT(*) from tableA, tableB equalsClob(tableA.clob, tableB.clob) = 1"

Enjoy!

Lotus Notes and Python: making them work together

I have recently came across a few tasks related to extracting some data from Lotus Notes databases. Looking up things on the Internet was a bit harder than usual – there were examples, but usually only listing a few solutions to very specific tasks, so I had to dig quite a bit to get enough pieces to be able to build my scripts. Here are a few pointers if you are facing the same challenge.

0. Get your setup right

There are two different ways to access Domino server from Python – you can choose between ODBC or COM interface. I tried both, and the first one might seem slightly easier, but in my case Notes ODBC driver was having some major problems with timestamps, dates and unicode, so I opted out for using the COM interface, instead.

First thing you need to do is to compile pywin32 extensions for python. You will need Microsoft Visual Studio 2008 (the most recent one failed to compile), and Windows SDK (for a message compiler utility mc.exe that is not part of the MVSE 2008 package).

1. Learn the interface

One of the biggest challenges of Notes programming through COM was that I had not been able to find a proper reference that would list objects, methods, and fields I can access through COM. After poking around, I found notes32.tlb, which is included with every Notes installation. This file is a Type Library that actually contains everything you need to know about the Notes interface. You can open it in MVSE, and it’s going to look something like this:

Classes are on the left, object methods and fields are listed on the right. MVSE will also show the method signatures. I agree, it’s not the best documentation, but it’s something. I also found that when you know a particular method name, it’s easier to google for samples of code containing it.

2. Start exploring

Connect to the Notes database (local):

import win32com.client
...
notesServer     = ''
notesFile	= 'C:\\my_cool_db.nsf'
notesPass	= 'abcdef123'
# Connect to notes database (returns NotesDatabase object)
notesSession = win32com.client.Dispatch('Lotus.NotesSession')
notesSession.Initialize(notesPass)
notesDatabase = notesSession.GetDatabase(notesServer, notesFile)

Connect to the Notes database (remote):

notesServer     = 'Full/Server/Name'
notesFile	= 'server\relative\path\my_cool_db.nsf'
notesPass	= 'abcdef123'

# Connect to notes database (returns NotesDatabase object)
notesSession = win32com.client.Dispatch('Lotus.NotesSession')
notesSession.Initialize(notesPass)
notesDatabase = notesSession.GetDatabase(notesServer, notesFile)

By the by, if you leave password blank, you will be prompted for it at runtime.

Iterate over all documents in the database:

doc = docs.GetFirstDocument()

while (doc is not None):
        print doc
        doc = docs.GetNextDocument(doc)

Look up NotesDocument in Type Library file to learn what else you can do once you have the document.

Handling Unicode
When you get a document item (field) value, it will return a tuple, often containing one value. To be able to work with that value, you might need to UTF-8 encode it, like this:

doc.GetItemValue('FieldName')[0].encode("utf-8")

Hope this helps, enjoy!

Permutations, unique sets, long division, and other beasts

I had an interesting problem to solve at work today. The challenge level was high enough to be able to solve it elegantly in a few hours, except, of course, it took me almost a day, because finding a contiguous block of two hours nowadays seems impossible.

Imagine that you need to write a program (application, class, two methods, however you want to call it) that will do these two things:

1) Given a set of integer ranges

(0..1, 0..1, 0..1)

it should be able to sequentially list all permutations, like so:

0 0 0
0 0 1
0 1 0
...
1 1 1

Essentially, you writing a function that will be able to count in a k-numeral system. Here are a few conditions:
- must accept reasonable, but arbitrary number of ranges on the input,
- ranges can be any pair of integers a and b, where a <= b, for example, -1..0 and -234.. 534 are two valid ranges,
- ranges can obviously differ from each other, for example, if you input (0..1, -1..2), the program should produce something like this:

0 -1
0 0
0 1
0 2
...
1 2

Seems easy enough, just a lot of little traps that can get in a way.

2) By giving your algorithm a second parameter, an integer k, you should be able to specify the kth number of the permutation sequence, so that when you run your code, it will produce all values of the sequence starting with permutation k + 1. For instance if k=2 for (0..1, 0..1, 0..1) the output will look:

0 1 0
0 1 1
1 0 0
...
1 1 1

For the second part, I used long division to determine the permutation that will correspond to k. The interesting thing about this part is that essentially you are writing a decimal converter capable of converting a decimal representation of an integer into a k-numeral system.

Bonus marks: extend the program to work with other types, so that you can have mixed sets on inputs (a range of integers and a range of strings).

I think this can be made into a neat programming assignment question, no?

Dropping all objects in a schema, including the schema itself in DB2

I was recently on a quest of finding a painless solution for dropping all objects from a specific schema and schema itself in a DB2 database. Unlike many other DBMS that require a custom procedure to do that DB2 graciously provides a procedure called ADMIN_DROP_SCHEMA:

>>-ADMIN_DROP_SCHEMA--(--schema--,--dropmode--,----------------->
>--errortabschema--,--errortab--)------------------------------><

The first parameter (must be capitalized) is the name of the schema to drop. Dropmode is reserved for future use and should be NULL. Errortabschema and Errortab are the error schema and table used by the procedure to store error information about objects it can’t drop. You must make sure that this table does not exist in the schema before you call the procedure. Procedure will fail if this table exists. In other words, if you need to drop multiple schemas, you will need to have a DROP TABLE errortabschema.errortab statement before each call.

Here is an example of code that will drop all supported objects (see DB2 InfoCentre link) in MYSCHEMA schema:

DROP TABLE ERRORSCHEMA.ERRORTABLE;
call ADMIN_DROP_SCHEMA('MYSCHEMA', NULL, 'ERRORSCHEMA', 'ERRORTABLE');

NB! This will work well for a setup when objects in other database schemas do not depend on objects on a schema to be dropped. If that’s the case, your best bet is probably dropping the database entirely. You could also try some of the scripts provided in this tutorial by Serge Rielau.

What do you use to validate your XML/DTD documents?

I am ashamed to say it out loud, but I usually use Firefox to check that the structure of an XML file is correct and all tags are matching. It is kinda inconvenient when you have to validate an XML document from the command line, and it’s even more inconvenient if you need to validate an XML document against a DTD.

Recently, I found out about a nifty utility called xmlstarlet that lets you, among other things, validate XML documents (including validation against a DTD). If you are on Ubuntu, you can get it by typing this into your terminal:

sudo apt-get install xmlstarlet

debugging DB2 PL/SQL

Is such a pain sometimes, because dbms_output module takes a while to setup and work glitchlessly and it’s really hard to do quick debugging when you can’t print anything from your code (in addition, dbms_output doesn’t seem to be available in DB2 Express-C). I ended up doing this:


CREATE TABLE debug
(
    event_time   TIMESTAMP,
    message      VARCHAR(1000)
);

--#SET TERMINATOR #
CREATE PROCEDURE print(message VARCHAR(1000))
LANGUAGE SQL
begin
    INSERT INTO debug VALUES(current timestamp, message);
end

Then I can just use:

call print('Test!');

from anywhere in my code and see the debug messages using:

SELECT * FROM debug;

catching last post_syncdb signal in Django

I was writing a Django application with a large back-end database with a lot of data that needed to be preloaded. I wanted a way to have a bunch of *.json files loaded sometime after syncdb finished executing. The kick was that I actually wanted to have them loaded after syncdb finishes executing (i.e. all individual models had been processed).

Django has a post_syncdb signal which fires when an individual model for one of the application is installed which is not exactly what I wanted, since I only wished my got getting executed once after all of the models had been installed. So here’s what I’ve come up with: this code basically checks if the current callback is for the last app in your INSTALLED_APPS list, and if so it executes some fixture loading code.

from django.db.models.signals import post_syncdb
import diacre.settings as settings
def load_data(sender, **kwargs):
    if kwargs['app'].__name__ == settings.INSTALLED_APPS[-1] + ".models":
        pass
        # load fixtures here

post_syncdb.connect(load_data)

Do You Speak Database? Then We Need You!

Greg Wilson writes:

Zuzel Vera Pacheco, one of my current graduate students, got ethics approval today for her research study. From her blog post:

Want to win a $100 Best Buy gift card? Do you have basic knowledge about database queries? If so, I need you!

Subjects are needed to take part in a study concerning the visualization of database queries. Participants will be asked to draw diagrams that represent the execution of database queries or to determine what queries are represented by a set of diagrams. This study will help design a tool intended to help expert and novice programmers to design and debug such queries. The time needed for the study will range from 30 minutes to an hour, and can take place in the Bahen Centre at the University of Toronto or elsewhere in the Greater Toronto Area.

A basic understanding of relational databases and database queries is required. The examples will contain queries in SQL and other programming languages like Ruby or Python. The participants should be fluent/conversant in English.

Participants who complete the study will be entered into a random draw for a $100 Best Buy gift card. The odds of winning this prize are 1 in 30.

For more information please contact:

Zuzel Vera Pacheco
zuzelvp@cs.toronto.edu

If you can help, please give her a shout—we could all learn something cool.

Post to Twitter Tweet This Post

Two reason you should do it:

1) Knowing Zuzel as a hard working person, the results of the studies are going to be interesting – contribute to CS research!

2) C’mon it’s a $100 BestBuy card :D