A Data Processing System for Flexible Employment With Different Deployment Characteristics.
Details about Nemo and its development can be found in:
- Our website: https://nemo.apache.org/
- Our project wiki: https://cwiki.apache.org/confluence/display/NEMO/
- Our Dev mailing list for contributing: [email protected] (subscribe)
Please refer to the Contribution guideline to contribute to our project.
- Java 8
- Maven
- YARN settings
- Download Hadoop 2.7.2 at https://archive.apache.org/dist/hadoop/common/hadoop-2.7.2/
- Set the shell profile as following:
export HADOOP_HOME=/path/to/hadoop-2.7.2 export YARN_HOME=$HADOOP_HOME export PATH=$PATH:$HADOOP_HOME/bin
- Protobuf 2.5.0
-
On Ubuntu 14.04 LTS and its point releases:
sudo apt-get install protobuf-compiler
-
On Ubuntu 16.04 LTS and its point releases:
sudo add-apt-repository ppa:snuspl/protobuf-250 sudo apt update sudo apt install protobuf-compiler=2.5.0-9xenial1
-
On macOS:
brew tap homebrew/versions brew install [email protected]
-
Or build from source:
- Downloadable at https://github.com/google/protobuf/releases/tag/v2.5.0
- Extract the downloaded tarball
./configure
make
make check
sudo make install
-
To check for a successful installation of version 2.5.0, run
protoc --version
-
- Run all tests and install:
mvn clean install -T 2C
- Run only unit tests and install:
mvn clean install -DskipITs -T 2C
Apache Nemo is an official runner of Apache Beam, and it can be executed from Beam, using NemoRunner, as well as directly from the Nemo project. The details of using NemoRunner from Beam is shown on the NemoRunner page of the Apache Beam website. Below describes how Beam applications can be run directly on Nemo.
-job_id
: ID of the Beam job-user_main
: Canonical name of the Beam application-user_args
: Arguments that the Beam application accepts-optimization_policy
: Canonical name of the optimization policy to apply to a job DAG in Nemo Compiler-deploy_mode
:yarn
is supported(default value islocal
)
## MapReduce example
./bin/run_beam.sh \
-job_id mr_default \
-executor_json `pwd`/examples/resources/executors/beam_test_executor_resources.json \
-optimization_policy org.apache.nemo.compiler.optimizer.policy.DefaultPolicy \
-user_main org.apache.nemo.examples.beam.WordCount \
-user_args "`pwd`/examples/resources/inputs/test_input_wordcount `pwd`/outputs/wordcount"
## YARN cluster example
./bin/run_beam.sh \
-deploy_mode yarn \
-job_id mr_transient \
-executor_json `pwd`/examples/resources/executors/beam_test_executor_resources.json \
-user_main org.apache.nemo.examples.beam.WordCount \
-optimization_policy org.apache.nemo.compiler.optimizer.policy.TransientResourcePolicy \
-user_args "hdfs://v-m:9000/test_input_wordcount hdfs://v-m:9000/test_output_wordcount"
## NEXMark streaming Q0 (query0) example
./bin/run_nexmark.sh \
-job_id nexmark-Q0 \
-executor_json `pwd`/examples/resources/executors/beam_test_executor_resources.json \
-user_main org.apache.beam.sdk.nexmark.Main \
-optimization_policy org.apache.nemo.compiler.optimizer.policy.StreamingPolicy \
-scheduler_impl_class_name org.apache.nemo.runtime.master.scheduler.StreamingScheduler \
-user_args "--runner=org.apache.nemo.client.beam.NemoRunner --streaming=true --query=0 --numEventGenerators=1"
-executor_json
command line option can be used to provide a path to the JSON file that describes resource configuration for executors. Its default value is config/default.json
, which initializes one of each Transient
, Reserved
, and Compute
executor, each of which has one core and 1024MB memory.
num
(optional): Number of containers. Default value is 1type
: Three container types are supported:Transient
: Containers that store eviction-prone resources. When batch jobs use idle resources inTransient
containers, they can be arbitrarily evicted when latency-critical jobs attempt to use the resources.Reserved
: Containers that store eviction-free resources.Reserved
containers are used to reliably store intermediate data which have high eviction cost.Compute
: Containers that are mainly used for computation.
memory_mb
: Memory size in MBcapacity
: Number ofTask
s that can be run in an executor. Set this value to be the same as the number of CPU cores of the container.
[
{
"num": 12,
"type": "Transient",
"memory_mb": 1024,
"capacity": 4
},
{
"type": "Reserved",
"memory_mb": 1024,
"capacity": 2
}
]
This example configuration specifies
- 12 transient containers with 4 cores and 1024MB memory each
- 1 reserved container with 2 cores and 1024MB memory
Nemo Compiler and Engine can store JSON representation of intermediate DAGs.
-dag_dir
command line option is used to specify the directory where the JSON files are stored. The default directory is./dag
. Using our online visualizer, you can easily visualize a DAG. Just drop the JSON file of the DAG as an input to it.
./bin/run_beam.sh \
-job_id als \
-executor_json `pwd`/examples/resources/executors/beam_test_executor_resources.json \
-user_main org.apache.nemo.examples.beam.AlternatingLeastSquare \
-optimization_policy org.apache.nemo.compiler.optimizer.policy.TransientResourcePolicy \
-dag_dir "./dag/als" \
-user_args "`pwd`/examples/resources/inputs/test_input_als 10 3"
- To exclude Spark related packages: mvn clean install -T 2C -DskipTests -pl \!compiler/frontend/spark,\!examples/spark
- To exclude Beam related packages: mvn clean install -T 2C -DskipTests -pl \!compiler/frontend/beam,\!examples/beam
- To exclude NEXMark related packages: mvn clean install -T 2C -DskipTests -pl \!examples/nexmark