Using the right data type for some calculation sounds like some obvious advice.

There are many blogs about using temporal data types for temporal data, instead of strings. An obvious reason is data integrity and correctness. We don’t gain much in storing dates as `2019-09-10`

in one record, and as `Nov 10, 2019`

in the next one. Performance is also important in this case – for example, because of statistics being more optimal (an often overlooked side effect of unfit data types).

But at a customer site, I’ve recently discovered a surprising (not surprising in hindsight) performance issue when using NUMBER instead of BINARY_DOUBLE in an Oracle database.

### NUMBER for monetary amounts

The `NUMBER`

type (or `DECIMAL`

, `NUMERIC`

in other RDBMS) is perfectly suited for all decimal numbers, which require the correct precision and rounding. I mean, if you ever encounter this kind of display in an invoice, such as when I purchase video games on steam:

… your trust level for the platform immediately goes down by factor `1.0e+1`

(even if technically, it does not matter in this case). So, the default, e.g. in banking systems for monetary amounts is always to use `NUMBER`

or an equivalent type, e.g. `java.math.BigDecimal`

.

### Calculations on numbers

But now, let’s assume we want to do some statistics on these amounts. Some aggregations of various numeric values, maybe even of their logarithms. We can establish that these two expressions are equal:

ln(a * b) = ln(a) + ln(b)

Or in other words, for positive numbers:

a * b = exp(ln(a) + ln(b))

We’ve already blogged about how this approach can be used to emulate a `PRODUCT()`

aggregate function in SQL, which is *very* useful for some cases, but none of the SQL databases jOOQ supports has built-in support for this yet. Notice, the blog post also takes care of zero and negative numbers.

### But which number type to choose?

Now, we might be tempted to just calculate the `LN(SOME_NUMBER)`

value, and sum that up using `SUM(LN(SOME_NUMBER))`

for this use-case. This turns out to be terribly slow in Oracle. We were thinking of bad indexes, first, even doubted aggregation in general, until I suggested we try using `BINARY_DOUBLE`

instead, in this case. In our case, we didn’t care about the exact numeric value of the amount. A IEEE 754 floating point number with double precision was going to be good enough.

The results were surprising. In a simple benchmark, we compared 10 approaches to calculating this sum of logarithms:

- Using
`NUMBER(20, 10)`

and`SUM(LN(the_number))`

- Using
`NUMBER(20, 10)`

and`SUM(LN(CAST(the_number AS BINARY_DOUBLE)))`

- Using
`NUMBER(20, 10)`

and`SUM(LN(TO_BINARY_DOUBLE(the_number)))`

- Using
`NUMBER(20, 10)`

, containing a pre-calculated LN value - Using
`DOUBLE PRECISION`

and`SUM(LN(the_double))`

- Using
`DOUBLE PRECISION`

and`SUM(LN(CAST(the_double AS BINARY_DOUBLE)))`

- Using
`DOUBLE PRECISION`

and`SUM(LN(TO_BINARY_DOUBLE(the_double)))`

- Using
`DOUBLE PRECISION`

, containing a pre-calculated LN value - Using
`BINARY_DOUBLE`

and`SUM(LN(the_binary))`

- Using
`BINARY_DOUBLE`

, containing a pre-calculated LN value

These were the thoughts:

- We tried the above 3 possible numeric data types, expecting
`BINARY_DOUBLE`

to be the fastest - We tried to pre-calculate the
`LN()`

value for the benchmark, to see how much effort goes into summing, and how much effort goes into the`LN()`

calculation with each type. While in general, in this system, such precalculation is impractical, we still wanted to have a benchmark comparison, in case a materialised view or other technique would be feasible. - We tried casting and converting each type to
`BINARY_DOUBLE`

prior to passing the value to the`LN()`

function. Instead of migrating all the data (with possible side effects), we wanted to see if we can solve this to a reasonable extent “on the fly”

The benchmark that I’m posting here uses this table and example data (full benchmark code at the end of the article):

CREATE TABLE data ( n1 NUMBER(20, 10), n2 NUMBER(20, 10), d1 DOUBLE PRECISION, d2 DOUBLE PRECISION, b1 BINARY_DOUBLE, b2 BINARY_DOUBLE ); INSERT INTO data SELECT level, ln(level), level, ln(level), level, ln(level) FROM dual CONNECT BY level <= 100000;

So, we have 100000 records, whose `SUM(LN(x))`

we want to calculate in the above 10 different ways. `N1`

contains the raw numeric value, and `N2`

contains the pre-calculated `LN(N1)`

.

The benchmark technique is described here. Do note it has a lot of caveats and is only useful for a limited number of verifications. Please always be careful when running such benchmarks, they often do not test production-like situations –

and never use such benchmarks to compare different RDBMS products directly. They are only useful to compare two approaches on the.same RDBMS product

The results as run on Oracle 18c XE in Docker on development hardware are below. Times are compared relative to the fastest run, as actual time spent in each execution is not interesting for comparison. We did have similar results on production-like hardware, though, and I’m sure you can produce similar results in other RDBMS:

NUMBER(20, 10) ------------------------------------------------- Run 3, Statement 1 : 280.34143 (avg : 280.75347) Run 3, Statement 2 : 7.99402 (avg : 8.03506) Run 3, Statement 3 : 7.71383 (avg : 7.73017) Run 3, Statement 4 : 1.05456 (avg : 1.11735) DOUBLE PRECISION ------------------------------------------------ Run 3, Statement 5 : 278.89476 (avg : 279.72981) Run 3, Statement 6 : 8.06512 (avg : 8.07033) Run 3, Statement 7 : 7.81873 (avg : 7.80063) Run 3, Statement 8 : 1.5315 (avg : 1.54347) BINARY_DOUBLE ------------------------------------------------ Run 3, Statement 9 : 2.4963 (avg : 2.57184) Run 3, Statement 10: 1 (avg : 1.02943)

How to read these results?

- Statement 10 is the fastest one, unsurprisingly, as it aggregates pre-calculated
`LN(binary_double)`

values. The precalculation of the function means that all the work has been done already before the report, and the data type is the one we expected to perform best*in general* - Statements 4 and 8 are almost as fast (precalculated
`LN()`

values). Being only slight factors off, we can attribute the difference to the usual benchmark flaws, although it’s interesting to see that`DOUBLE PRECISION`

seems 1.5x slower to sum than`BINARY_DOUBLE`

and even`NUMBER`

- Statements 1, 5, 9 are the ones where no data type conversion is applied and
`SUM(LN(the_value))`

is being calculated. It is*staggering*how much slower both`NUMBER`

and`DOUBLE PRECISION`

are than`BINARY_DOUBLE`

. Statements 1 and 5 are a factor of 112x slower than statement 9! - Statements 2-3, 6-7 prove that converting the
`NUMBER`

or`DOUBLE PRECISION`

vales to`BINARY_DOUBLE`

*on the fly*provides a sufficiently performant workaround, which made statements 2-3, 6-7 only 3x slower than statement 9 - Statements 2-3, 6-7 show that casting and converting are about equivalent

Note, we found similar results with other functions, such as `EXP()`

### Analysis

The order of magnitude may seem surprising at first, but thinking about it, it is not. We would never do CPU intensive computation with `java.math.BigDecimal`

in Java. The `BigDecimal`

type is there for numeric accuracy, e.g. when it really matters what the monetary amount is, *exactly*. When we run analytics on monetary amounts, using `double`

is sufficient in Java as well.

If our data is `BigDecimal`

, and we cannot reasonably change that, it might still be better to use the `BigDecimal::doubleValue`

conversion prior to further processing using e.g. `Math::log`

. So, this translates directly to SQL, whose `LN()`

implementations are data type specific. IEEE 754 having been designed precisely for this purpose.

When doing CPU intensive computations both in Java, or in the database, we should always evaluate our various options of

- Quick fixing our data sets for the report (ad-hoc conversion prior to calculation)
- Thoroughly fixing our data sets in the schema (migration towards a better data type)
- Preprocessing our data sets (precalculating some very commonly used computations)

### Benchmark logic

Just run the below on your own hardware. I’m curious to see your results:

-- Copyright Data Geekery GmbH -- -- Licensed under the Apache License, Version 2.0 (the "License"); -- you may not use this file except in compliance with the License. -- You may obtain a copy of the License at -- -- http://www.apache.org/licenses/LICENSE-2.0 -- -- Unless required by applicable law or agreed to in writing, software -- distributed under the License is distributed on an "AS IS" BASIS, -- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -- See the License for the specific language governing permissions and -- limitations under the License. -- -- This version displays relative execution times (fastest execution = 1) SET SERVEROUTPUT ON ALTER SYSTEM FLUSH SHARED_POOL; ALTER SYSTEM FLUSH BUFFER_CACHE; CREATE TABLE data ( n1 NUMBER(20, 10), n2 NUMBER(20, 10), d1 DOUBLE PRECISION, d2 DOUBLE PRECISION, b1 BINARY_DOUBLE, b2 BINARY_DOUBLE ); INSERT INTO data SELECT level, ln(level), level, ln(level), level, ln(level) FROM dual CONNECT BY level <= 100000; CREATE TABLE results ( run NUMBER(2), stmt NUMBER(2), elapsed NUMBER ); DECLARE v_ts TIMESTAMP WITH TIME ZONE; v_repeat CONSTANT NUMBER := 10; v_stmt NUMBER; BEGIN -- Repeat the whole benchmark several times to avoid warmup penalty FOR r IN 1..5 LOOP v_ts := SYSTIMESTAMP; v_stmt := 1; FOR i IN 1..v_repeat LOOP FOR rec IN ( SELECT sum(ln(n1)) FROM data ) LOOP NULL; END LOOP; END LOOP; INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE); v_ts := SYSTIMESTAMP; v_stmt := v_stmt + 1; FOR i IN 1..v_repeat LOOP FOR rec IN ( SELECT sum(ln(cast(n1 as binary_double))) FROM data ) LOOP NULL; END LOOP; END LOOP; INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE); v_ts := SYSTIMESTAMP; v_stmt := v_stmt + 1; FOR i IN 1..v_repeat LOOP FOR rec IN ( SELECT sum(ln(to_binary_double(n1))) FROM data ) LOOP NULL; END LOOP; END LOOP; INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE); v_ts := SYSTIMESTAMP; v_stmt := v_stmt + 1; FOR i IN 1..v_repeat LOOP FOR rec IN ( SELECT sum(n2) FROM data ) LOOP NULL; END LOOP; END LOOP; INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE); v_ts := SYSTIMESTAMP; v_stmt := v_stmt + 1; FOR i IN 1..v_repeat LOOP FOR rec IN ( SELECT sum(ln(d1)) FROM data ) LOOP NULL; END LOOP; END LOOP; INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE); v_ts := SYSTIMESTAMP; v_stmt := v_stmt + 1; FOR i IN 1..v_repeat LOOP FOR rec IN ( SELECT sum(ln(cast(d1 as binary_double))) FROM data ) LOOP NULL; END LOOP; END LOOP; INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE); v_ts := SYSTIMESTAMP; v_stmt := v_stmt + 1; FOR i IN 1..v_repeat LOOP FOR rec IN ( SELECT sum(ln(to_binary_double(d1))) FROM data ) LOOP NULL; END LOOP; END LOOP; INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE); v_ts := SYSTIMESTAMP; v_stmt := v_stmt + 1; FOR i IN 1..v_repeat LOOP FOR rec IN ( SELECT sum(d2) FROM data ) LOOP NULL; END LOOP; END LOOP; INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE); v_ts := SYSTIMESTAMP; v_stmt := v_stmt + 1; FOR i IN 1..v_repeat LOOP FOR rec IN ( SELECT sum(ln(b1)) FROM data ) LOOP NULL; END LOOP; END LOOP; INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE); v_ts := SYSTIMESTAMP; v_stmt := v_stmt + 1; FOR i IN 1..v_repeat LOOP FOR rec IN ( SELECT sum(b2) FROM data ) LOOP NULL; END LOOP; END LOOP; INSERT INTO results VALUES (r, v_stmt, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE); END LOOP; FOR rec IN ( SELECT run, stmt, CAST(elapsed / MIN(elapsed) OVER() AS NUMBER(10, 5)) ratio, CAST(AVG(elapsed) OVER (PARTITION BY stmt) / MIN(elapsed) OVER() AS NUMBER(10, 5)) avg_ratio FROM results ORDER BY run, stmt ) LOOP dbms_output.put_line('Run ' || rec.run || ', Statement ' || rec.stmt || ' : ' || rec.ratio || ' (avg : ' || rec.avg_ratio || ')'); END LOOP; dbms_output.put_line(''); dbms_output.put_line('Copyright Data Geekery GmbH'); dbms_output.put_line('https://www.jooq.org/benchmark'); END; / DROP TABLE data; DROP TABLE results;

Source: jOOQ

## Leave a Reply