Computing SQL Aggregates

Some statements:

Statement 1. The result set from a query with more GROUP BY columns provides an answer for a query with less GROUP BY columns.

The result set from the query:

SELECT salesman, state, SUM(sales)
FROM company.sales
GROUP BY salesman, state;

provides an answer for the query:

SELECT salesman, SUM(sales)
FROM company.sales
GROUP BY salesman

Statement 2. The result set from a query with a more liberal WHERE clause provides an answer to a query with a more restrictive WHERE clause.

The result from a query like:

SELECT date, salesman, SUM(sales)
FROM company.sales
GROUP BY date, salesman
WHERE date > '2003-01-01'

provides an answer for the following query:

SELECT date, salesman, SUM(sales)
FROM company.sales
GROUP BY date, salesman
WHERE date > '2003-01-01' AND date > '2003-06-01'

Statement 3. The result of a query with more participating tables provides an answer for a query with less participating tables.

The query:

SELECT salesman, gender, SUM(sales)
FROM company.sales
INNER JOIN salesman USING (salesman_id)
GROUP BY salesman, gender

provides an answer for:

SELECT salesman, SUM(sales)
FROM company.sales
GROUP BY salesman

Statement 4. The result of a query with more aggregate functions provides an answer to every query that contains a subset of those aggregate functions.

SELECT SUM(quantity), SUM(price)
FROM company.sales
GROUP BY salesman

provides an answer for:

SELECT SUM(quanity)
FROM company.sales
GROUP BY salesman

Conclusion: A caching system can be built that caches result sets in order to answer future queries that are sufficiently similar.

Implementation:

1. We take an initial query that we will cache:

SELECT salesman, state, COUNT(*)
FROM company.sales
GROUP BY salesman, state;

2. We execute the query, cache the result and return it to the user:

CREATE TABLE cache.1234567
SELECT salesman, COUNT(*)
FROM company.sales
GROUP BY salesman, state;
SELECT * FROM cache.1234567;

3. We get a follow-up query:

SELECT state, COUNT(*)
FROM sales
GROUP BY state;

4. We use the cache to answer the query:

SELECT state, SUM(`COUNT(*)`) AS `COUNT(*)`
FROM cache.1234567
GROUP BY state

COUNT(*) is a valid column name in cache.1234567.

Cache implementation:

1. For each cached query, we mantain:

2. A subsequent query results in a cache hit if:

Feasibility:

This system will be possible under the following assumptions. They are easily satisfied if the database client behaves in a certain way and will also occur naturally in pivot tables and other data mininig applications. The net result is that the cache will be able to answer a broader range of queries, making it feasible.

Condition 1. A friendly database client will submit queries that allow cache-matching WHERE statement to be identified correctly.

SELECT COUNT(*)
FROM company.sales
WHERE date > CURDATE()

will be used to answer

SELECT COUNT(*)
FROM company.sales
WHERE date > CURDATE() AND date > NOW()

even though the latter contains a redundant expression, it will help in producing a cache hit.

Condition 2. The cache will bring in more aggregate functions than requested

SELECT SUM(price)
FROM company.sales
GROUP BY state

will be silently rewritten into:

SELECT SUM(price), MIN(price), MAX(price), AVG(price)
FROM company.sales
GROUP BY state

and all 4 columns will be cached. This comes from the assumption that the table will be scanned anyway to produce the SUM(), it does not hurt to calculate a few more values in the process.

Condition 3. The cache will drill down more than requested

Given a knowledge in the hierarchy of aggregations, a query:

SELECT SUM(price)
FROM company.sales
GROUP BY country

will be transformed into:

SELECT SUM(price)
FROM company.sales
GROUP BY country, state

The result set from the updated query will be able to answer both the original one, as well as any futher query resulting from a drill-down in the dimension of the original query.

Condition 4. The cache will bring into the query more tables than originally requested

Given enough information about many-to-one relationships, for the query

SELECT salesman_id, SUM(price)
FROM company.sales
GROUP BY salesman_id

the cache will execute and cache

SELECT salesman_id, salesmen.sex, salesmen.age, SUM(sales.price)
FROM company.salesmen
LEFT JOIN company.sales
USING (salesman_id)
GROUP BY salesman_id

Which will allow the following query to be answered from the cache:

SELECT age, COUNT(*)
FROM company.salesmen
GROUP BY age