D2F-Bench (Data File Format Benchmark)

Prerequisites

Apache Hadoop 2.2 or later / Cloudera CDH / HortonWorks HDP
Apache Hive version 13 or later
Between 15 minutes and 2 days to generate data (depending on the Scale Factor you choose and available hardware).
Supported engines: Hive, Spark SQL, Impala, Pig

Install & Setup TPC-H

1.Create a copy of the project locally:

git clone https://github.com/t-ivanov/D2F-Bench

2.Download, compile and package the appropriate TPC-H data generator by executing:

./tpch-build.sh

3.Adjust the benchmark settings according to your needs.

Edit the ~/D2F-Bench/tpch/conf/settings.sh file:

TABLES="part partsupp supplier customer orders lineitem nation region" -- defines which tables to use for the tests. This should not be changed unless you are testing any new functionality.
SCALE=2 -- defines the scale factor for the data generation phase. (2=2GB, 3=3GB ... 1000=1TB). You need to decide on a "Scale Factor" which represents how much data you will generate. Scale Factor roughly translates to gigabytes, so a Scale Factor of 100 is about 100 gigabytes and one terabyte is Scale Factor 1000. Decide how much data you want and keep it in mind for the next step. If you have a cluster of 4-10 nodes or just want to experiment at a smaller scale, scale 1000 (1 TB) of data is a good starting point. If you have a large cluster, you may want to choose Scale 10000 (10 TB) or more.
ENGINE=hive -- set the execution engine with which you want to run the benchmark. Supported are hive, pig, spark-sql, impala.
FILE_FORMAT=orc -- defines the data file format for storing the table data. Supported are ORC, RCfile, Parquet, SequenceFile, TextFile.
TEST_QUERIES="1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22" -- defines the list of queries to execute sequencially in one data test. Between each execution is performed cache cleanup on underlying operating system.

There are additional parameters which are engine specific and are curently in test-mode.

4.Generate and load the data by executing (under root):

sudo ./tpch-setup.sh

5.Execute the queries specified in the TEST_QUERIES list:

./tpch-exec-queries.sh

6.Results from the query runs are stored in folder ~/D2F-Bench/tpch/results/[Current-date]/[database_name_query_number].txt. Additionally, in the file ~/D2F-Bench/tpch/logs/query_times.csv are stored all query times from the latest executions.

ACKNOWLEDGMENTS

Hive-testbench - https://github.com/hortonworks/hive-testbench

BigBench - https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench

HiBench Suite - https://github.com/intel-hadoop/HiBench

TPC-H for Pig - https://github.com/JerryLead/BenchmarkScripts & https://github.com/ssavvides/tpch-pig

TPC-H for Impala & Hive - https://github.com/kj-ki/tpc-h-impala

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
bin		bin
tpch		tpch
Documentation.Project.Drill.pdf		Documentation.Project.Drill.pdf
LICENSE		LICENSE
Presto_Doc.pdf		Presto_Doc.pdf
README.md		README.md
docuTajo.pdf		docuTajo.pdf
engines.md		engines.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

D2F-Bench (Data File Format Benchmark)

Prerequisites

Install & Setup TPC-H

ACKNOWLEDGMENTS

About

Releases

Packages

Languages

License

t-ivanov/D2F-Bench

Folders and files

Latest commit

History

Repository files navigation

D2F-Bench (Data File Format Benchmark)

Prerequisites

Install & Setup TPC-H

ACKNOWLEDGMENTS

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages