Skip to main content
Skip to main content
Edit this page

TPC-H (1999)

A popular benchmark which models the internal data warehouse of a wholesale supplier. The data is stored into a 3rd normal form representation, requiring lots of joins at query runtime. Despite its age and its unrealistic assumption that the data is uniformly and independently distributed, TPC-H remains the most popular OLAP benchmark to date.

References

Data Generation and Import

First, checkout the TPC-H repository and compile the data generator:

Then, generate the data. Parameter -s specifies the scale factor. For example, with -s 100, 600 million rows are generated for table 'lineitem'.

Detailed table sizes with scale factor 100:

Tablesize (in rows)size (compressed in ClickHouse)
nation252 kB
region51 kB
part20.000.000895 MB
supplier1.000.00075 MB
partsupp80.000.0004.37 GB
customer15.000.0001.19 GB
orders150.000.0006.15 GB
lineitem600.00.0026.69 GB

(Compressed sizes in ClickHouse are taken from system.tables.total_bytes and based on below table definitions.)

Now create tables in ClickHouse.

We stick as closely as possible to the rules of the TPC-H specification:

  • Primary keys are created only for the columns mentioned in section 1.4.2.2 of the specification.
  • Substitution parameters were replaced by the values for query validation in sections 2.1.x.4 of the specification.
  • As per section 1.4.2.1, the table definitions do not use the optional NOT NULL constraints, even if dbgen generates them by default. The performance of SELECT queries in ClickHouse is not affected by the presence or absence of NOT NULL constraints.
  • As per section 1.3.1, we use ClickHouse's native datatypes (e.g. Int32, String) to implement the abstract datatypes mentioned in the specification (e.g. Identifier, Variable text, size N). The only effect of this is better readability, the SQL-92 datatypes generated by dbgen (e.g. INTEGER, VARCHAR(40)) would also work in ClickHouse.

The data can be imported as follows:

Note

Instead of using tpch-kit and generating the tables by yourself, you can alternatively import the data from a public S3 bucket. Make sure to create empty tables first using above CREATE statements.

Queries

Note

Setting join_use_nulls should be enabled to produce correct results according to SQL standard.

The queries are generated by ./qgen -s <scaling_factor>. Example queries for s = 100:

Correctness

The result of the queries agrees with the official results unless mentioned otherwise. To verify, generate a TPC-H database with scale factor = 1 (dbgen, see above) and compare with the expected results in tpch-kit.

Q1

Q2

Note

As of October 2024, the query does not work out-of-the box due to correlated subqueries. Corresponding issue: https://github.com/ClickHouse/ClickHouse/issues/6697

This alternative formulation works and was verified to return the reference results.

Q3

Q4

Note

As of October 2024, the query does not work out-of-the box due to correlated subqueries. Corresponding issue: https://github.com/ClickHouse/ClickHouse/issues/6697

This alternative formulation works and was verified to return the reference results.

Q5

Q6

Note

As of October 2024, the query does not work out-of-the box due to a bug with Decimal addition. Corresponding issue: https://github.com/ClickHouse/ClickHouse/issues/70136

This alternative formulation works and was verified to return the reference results.

Q7

Q8

Q9

Q10

Q11

Q12

Q13

Note

This alternative formulation works and was verified to return the reference results.

Q14

Q15

Q16

Q17

Note

As of October 2024, the query does not work out-of-the box due to correlated subqueries. Corresponding issue: https://github.com/ClickHouse/ClickHouse/issues/6697

This alternative formulation works and was verified to return the reference results.

Q18

Q19

Q20

Note

As of October 2024, the query does not work out-of-the box due to correlated subqueries. Corresponding issue: https://github.com/ClickHouse/ClickHouse/issues/6697

Q21

Note

As of October 2024, the query does not work out-of-the box due to correlated subqueries. Corresponding issue: https://github.com/ClickHouse/ClickHouse/issues/6697

Q22