Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in a training process #3

Open
Hramchenko opened this issue Jan 15, 2020 · 0 comments
Open

Error in a training process #3

Hramchenko opened this issue Jan 15, 2020 · 0 comments

Comments

@Hramchenko
Copy link

Hramchenko commented Jan 15, 2020

Hello. I'm trying to train flowpp++ on my own data (32x32x3 images like in CIFAR10), but the process aborted every 1-2 epochs with an error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
val_bpd=4.84827 val_inverr=2.76096 num_val_examples=0001912
iter=0011700 epoch=1.09920 bpd=4.90475 gnorm=7888.22754 lr=0.00030 fps=19.27829 sps=2.40979
iter=0011800 epoch=1.20253 bpd=4.90307 gnorm=8177.17480 lr=0.00030 fps=19.20983 sps=2.40123
Traceback (most recent call last):
  File "/home/user/ML/VENV_HOROVOD/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/user/ML/VENV_HOROVOD/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/user/ML/VENV_HOROVOD/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: Input is not invertible.
         [[{{node gradients/Pointwise_12_1/MatrixDeterminant_grad/MatrixInverse}}]]
         [[global_norm/global_norm/_43351]]
  (1) Invalid argument: Input is not invertible.
         [[{{node gradients/Pointwise_12_1/MatrixDeterminant_grad/MatrixInverse}}]]
0 successful operations.
0 derived errors ignored.

and results are not look like a training sample even after 10k iterations:

individualImage

sample images:

Screenshot_20200115_122725_S

Do you know the cause of the problem? Is there any way to fix this error?
P.S. I have only one GPU, so I start a program with mpiexec -n 1 python run_cifar.py --checkpoint=.... Train parameters: init_bs=16, total_bs=8. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant