New node software for large-scale clients with PB-scale data onboarding to Filecoin network
Looking for standalone Deal Preparation? Try singularity-prepare
Looking for a complete end-to-end demonstration? Try Getting Started Guide
# Install nvm (https://github.com/nvm-sh/nvm#install--update-script)
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.1/install.sh | bash
source ~/.bashrc
# Install node v16
nvm install 16
npm i -g @techgreedy/singularity
singularity -h
git clone https://github.com/tech-greedy/singularity.git
cd singularity
npm ci
npm run build
npm link
singularity -h
By default, npm will pull the pre-built binaries for dependencies. You can choose to build it from source and override the one pulled by npm.
# Make sure you have go v1.17+ installed
git clone https://github.com/tech-greedy/go-generate-car.git
cd go-generate-car
make
Then copy the generated binary to override the existing one from the PATH for your node environment, i.e.
- singularity installed globally
/home/user/.nvm/versions/node/v16.xx.x/lib/node_modules/.bin
- singularity cloned locally
./node_modules/.bin
Note that the path may change depending on the nodejs version.
If you cannot find the folder above, try searching for the generate-car
binary first (i.e.m find ~/.nvm -name 'generate-car'
).
To use the tool as a daemon, it needs to initialize the config and the database. To do so, run
singularity init
By default, a repository will be initialized at $HOME_DIR/.singularity
.
Set the environment variable SINGULARITY_PATH
to override this behavior.
# Unix
export SINGULARITY_PATH=/the/path/to/the/repo
# Windows
set SINGULARITY_PATH=/the/path/to/the/repo
Since the tool is modularized, it can be deployed in different ways and have different components enabled or disabled.
Below are configurations for common scenarios.
This is useful if you only need deal preparation but not deal making.
You can still have deal making enabled, but disabling it will use slightly less system resources.
In default.toml from your repo
- change
index_service.enabled
to false - change
ipfs.enabled
to false - change
deal_tracking_service.enabled
to false - change
deal_replication_service.enabled
to false - change
deal_replication_worker.enabled
to false
This is useful if you know MongoDB, and you're hitting some bottlenecks or issues from the built-in MongoDb.
- Setup your own MongoDb instance
- In default.toml from your repo
- change
database.start_local
to false - change
connection.database
to the connection string of your own MongoDb database
- change
- On master server, set
deal_preparation_service.enabled
,database.start_local
to true and disable all other modules - On worker servers, set
deal_preparation_worker.enabled
to true and disable all other modules. Changeconnection.database
andconnection.deal_preparation_service
to the IP address of the master server
$ singularity
Usage: singularity [options] [command]
A tool for large-scale clients with PB-scale data onboarding to Filecoin network
Visit https://github.com/tech-greedy/singularity for more details
Options:
-V, --version output the version number
-h, --help display help for command
Commands:
init Initialize the configuration directory in SINGULARITY_PATH
If unset, it will be initialized at HOME_DIR/.singularity
daemon Start a daemon process for deal preparation and deal making
index Manage the dataset index which will help map the dataset path to actual piece
preparation|prep Manage deal preparation
help [command] display help for command
export SINGULARITY_PATH=/the/path/to/the/repo
singularity daemon
Deal preparation contains two parts
- Scanning Request - an initial effort to scan the directory and make plans of how to assign different files and folders to different chunks
- Generation Request - subsequent works to generate the car file and compute the commP
$ singularity prep -h
Usage: singularity preparation|prep [options] [command]
Manage deal preparation
Options:
-h, --help display help for command
Commands:
create [options] <datasetName> <datasetPath> <outDir> Start deal preparation for a local dataset
status [options] <dataset> Check the status of a deal preparation request
list [options] List all deal preparation requests
generation-manifest [options] <generationId> Get the Slingshot v3.x manifest data for a single deal generation request
generation-status [options] <generationId> Check the status of a single deal generation request
pause Pause scanning or generation requests
resume Resume scanning or generation requests
retry Retry scanning or generation requests
remove [options] <dataset> Remove all records from database for a dataset
help [command] display help for command
This will create a scanning request for a dataset. While the dataset is being scanned, it will also produce generation requests to be taken by workers.
$ singularity prep create -h
Usage: singularity preparation create [options] <datasetName> <datasetPath> <outDir>
Start deal preparation for a local dataset
Arguments:
datasetName A unique name of the dataset
datasetPath Directory path to the dataset
outDir The output Directory to save CAR files
Options:
-s, --deal-size <deal_size> Target deal size, i.e. 32GiB (default: "32 GiB")
-t, --tmp-dir <tmp_dir> Optional temporary directory. May be useful when it is at least 2x faster than the dataset source, such as when the dataset is on network mount, and the I/O is the bottleneck
-f, --skip-inaccessible-files Skip inaccessible files. Scanning may take longer to complete.
-m, --min-ratio <min_ratio> Min ratio of deal to sector size, i.e. 0.55
-M, --max-ratio <max_ratio> Max ratio of deal to sector size, i.e. 0.95
-h, --help display help for command
The deal preparation supports public S3 bucket natively. Temporary directory is mandatory when using with S3 bucket. i.e.
singularity prep create -t <tmp_dir> <dataset_name> s3://<bucket_name>/<optional_prefix>/ <out_dir>
For each dataset preparation request, it always starts with scanning request, once enough files can be packed into a single deal, it will create a generation request. In other words, each preparation request is a single scanning request and a bunch of generation requests.
You can pause/resume/retry the scanning request or generation requests.
singularity prep pause -h
singularity prep resume -h
singularity prep retry -h
The whole data preparation requests can be removed from database. All generated CAR files can also be deleted by
specifying --purge
option.
singularity prep remove -h
List all the deal preparation requests, including whether scanning has completed and how many generation requests have completed or hit errors for each of them.
singularity prep list
Check status for a specific deal preparation request, including the status of the initial scanning request and all corresponding generation requests.
singularity prep status -h
Look into a specific generation request, including what are the files or folders included in that request and their corresponding size, cid, selector, etc.
singularity prep generation-status -h
singularity prep generation-manifest -h
WEB3_STORAGE_TOKEN="eyJ..." singularity prep upload-manifest -h
singularity monitor
Deal replication module supports both lotus-market and boost based storage providers (later on we might deprecate lotus-market support). Currently it is required to have both lotus and boost cli binary in order for this module to work.
Look for default.toml
in the initialized repo, verify in the [deal_replication_worker] section, both binary can be
accessed.
If you need to specify environment variable like FULLNODE_API_INFO, it can also be specified there.
In order to make deals, we recommend setting up a lite node to use with the tool.
Once you have the lite node setup, you can import your wallet key for the verified client address.
If your target SP runs on Boost, boost executable is also needed to be able to make deal.
Once you have the boost cli initialized, you can import your wallet key for the verified client address.
$ singularity repl start -h
Usage: singularity replication start [options] <datasetid> <storage-providers> <client> [# of replica]
Start deal replication for a prepared local dataset
Arguments:
datasetid Existing ID of dataset prepared.
storage-providers Comma separated storage provider list
client Client address where deals are proposed from
# of replica Number of targeting replica of the dataset (default: 10)
Options:
-u, --url-prefix <urlprefix> URL prefix for car downloading. Must be reachable by provider's boostd node. (default: "http://127.0.0.1/")
-p, --price <maxprice> Maximum price per epoch per GiB in Fil. (default: "0")
-r, --verified <verified> Whether to propose deal as verified. true|false. (default: "true")
-s, --start-delay <startdelay> Deal start delay in days. (StartEpoch) (default: "7")
-d, --duration <duration> Duration in days for deal length. (default: "525")
-o, --offline <offline> Propose as offline deal. (default: "true")
-m, --max-deals <maxdeals> Max number of deals in this replication request per SP, per cron triggered. (default: "0")
-c, --cron-schedule <cronschedule> Optional cron to send deals at interval. Use double quote to wrap the format containing spaces.
-x, --cron-max-deals <cronmaxdeals> When cron schedule specified, limit the total number of deals across entire cron, per SP.
-xp, --cron-max-pending-deals <cronmaxpendingdeals> When cron schedule specified, limit the total number of pending deals determined by dealtracking service, per SP.
-l, --file-list-path <filelistpath> Absolute path to a txt file that will limit to replicate only from the list. Must be visible by deal replication worker.
-n, --notes <notes> Any notes or tag want to store along the replication request, for tracking purpose.
-h, --help display help for command
A simple example to send all car files in one prepared dataset "CommonCrawl" to one storage provider f01234 immediately:
singularity repl start CommonCrawl f01234 f15djc5avdxihgu234231rfrrzbvnnqvzurxe55kja
A more complex example, send 10 deals to storage provider f01234 and f05678, every hour on the 1st minute from prepared dataset "CommonCrawl", until all CAR files are dealt.
singularity repl start -m 10 -c "1 * * * *" CommonCrawl f01234,f05678 f15djc5avdxihgu234231rfrrzbvnnqvzurxe55kja
Look for default.toml
in the initialized repo.
This sets the MongoDb connection string. The default value corresponds to the built-in MongoDb server shipped with this software. If you choose to use a standalone MongoDb service, set the connection string here.
Sets the API endpoint of deal preparation service.
The software is shipping with a built-in MongoDb server. For small to medium-sized dataset, this should be sufficient.
For users who're onboarding large scale datasets, we recommend running your own MongoDb service which fits into your
infrastructure by setting this value to false
.
To connect to a standalone MongoDb service, set the value of connection string here.
Not that the MongoDB server may consume as much as 80% of usable memory.
The path of the database files the built-in MongoDb will be using, as well as the IP and port to bind the service to.
Service to manage preparation requests
Whether to enable the service and which IP and port to bind the service to
If the service crashes or is interrupted, there may be incomplete CAR files generated. Enabling this can clean them up.
The default min/max ratio of CAR file size divided by the target deal size. The dataset splitting is performed with below logic
- Perform a Glob pattern match and get all files in sorted order
- Iterate through all the files and keep accumulating file sizes into a chunk
- Once the size of a chunk is between min and max ratio, pack this chunk to a CAR file and start with a new chunk
- If the size of the file is too large to fit into a chunk, split the file to hit the min ration
Worker to scan the dataset, make plan and generate Car file and CIDs
Whether to enable the worker and how many worker instances. As a rule of thumb, use min(cpu_cores / 2, io_MBps / 20)
Each generation worker consumes negligible RAM, 20-50 MiB/s disk I/O and 100-250% of CPU.
Each 32GiB deal takes ~10 minutes to be generated on AMD EPYC CPU with NVME drive.
- When dealing with lots of small files, CPU usage increases while generation speed decreases. Meanwhile, IO may become the bottleneck if not using SSD.
- When using S3 bucket public as the dataset, the Internet Speed may become the bottleneck
The repo ~/.singularity
or the folder specified by SINGULARITY_PATH
contains all state of the service.
To backup, simply backup the repo folder.
Starting version 2.0.0, anonymous data including error messages, data preparation and deal making statistics
will be collected for us to better understand how the software is used and improve the software. To disable behavior,
create and set metrics.enabled
to false
in default.toml
.
Use --skip-inaccessible-files
when creating the data preparation request singularity prep create
.
For existing generation requests, use singularity prep retry gen --skip-inaccessible-files
,
however this currently only works when the tmpDir is used.
Only Deal Preparation works and Indexing works on Windows. Deal Replication and Retrieval only works in Linux/Mac due to dependency restrictions.
In case that one CAR contains more files than allowed by OS, you will need to increase the open file limit with ulimit
, or LimitNOFILE
if using systemd.
Depending on the version, NodeJS by default has a max heap memory of 2GB. To increase this limit, i.e. to increase to
4G, set environment variable
NODE_OPTIONS="--max-old-space-size=4096"
.
If you are using network mount such as NFS or Goofys, a temporary network issue may cause the CAR file generation to
fail.
If the error rate is less than 10%, you may assume they are transient and can be fixed by performing
a retry.
If the error is consistent, you will need to dig into the root cause of what have gone wrong. It could be incorrectly
configured permission or DNS resolver, etc. You can find more details in /var/log/syslog
.
Avoid using root, or try the fix below
chown -R $(whoami) ~/
npm config set unsafe-perm true
npm config set user 0
Something wrong while starting MongoDB. Check what has gone wrong
MONGOMS_DEBUG=1 singularity daemon
If the error shows libcrypto.so.1.1
cannot be found. Try this solution.
Create a bug report or request a feature.