You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello. I'm trying to train flowpp++ on my own data (32x32x3 images like in CIFAR10), but the process aborted every 1-2 epochs with an error:
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
val_bpd=4.84827 val_inverr=2.76096 num_val_examples=0001912
iter=0011700 epoch=1.09920 bpd=4.90475 gnorm=7888.22754 lr=0.00030 fps=19.27829 sps=2.40979
iter=0011800 epoch=1.20253 bpd=4.90307 gnorm=8177.17480 lr=0.00030 fps=19.20983 sps=2.40123
Traceback (most recent call last):
File "/home/user/ML/VENV_HOROVOD/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/user/ML/VENV_HOROVOD/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/home/user/ML/VENV_HOROVOD/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Input is not invertible.
[[{{node gradients/Pointwise_12_1/MatrixDeterminant_grad/MatrixInverse}}]]
[[global_norm/global_norm/_43351]]
(1) Invalid argument: Input is not invertible.
[[{{node gradients/Pointwise_12_1/MatrixDeterminant_grad/MatrixInverse}}]]
0 successful operations.
0 derived errors ignored.
and results are not look like a training sample even after 10k iterations:
sample images:
Do you know the cause of the problem? Is there any way to fix this error?
P.S. I have only one GPU, so I start a program with mpiexec -n 1 python run_cifar.py --checkpoint=.... Train parameters: init_bs=16, total_bs=8. Thanks.
The text was updated successfully, but these errors were encountered:
Hello. I'm trying to train flowpp++ on my own data (32x32x3 images like in CIFAR10), but the process aborted every 1-2 epochs with an error:
and results are not look like a training sample even after 10k iterations:
sample images:
Do you know the cause of the problem? Is there any way to fix this error?
P.S. I have only one GPU, so I start a program with
mpiexec -n 1 python run_cifar.py --checkpoint=...
. Train parameters:init_bs=16, total_bs=8
. Thanks.The text was updated successfully, but these errors were encountered: