Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

To train my model means fit-tuning or retrain a model? #10

Open
wjy979769265 opened this issue Jun 14, 2019 · 4 comments
Open

To train my model means fit-tuning or retrain a model? #10

wjy979769265 opened this issue Jun 14, 2019 · 4 comments

Comments

@wjy979769265
Copy link

Sorry, i have question about how to perform fit-tuning in your pre-trained PrettyBig model, because i want to generate some texts related to my dataset.
Thank you very much :)

@wjy979769265
Copy link
Author

wjy979769265 commented Jun 14, 2019

And i'm tring to train a model follow your guides, I created a json file like this:
And I create a new floder to save my model checkpoint. And i've transfer my dataset to tfrecords file.

{ "n_head": 16, "encoder_path": "encoder", "n_vocab": 50257, "embed_dropout": 0.1, "lr": 0.00025, "warmup_steps": 2000, "weight_decay": 0.01, "beta1": 0.9, "beta2": 0.98, "epsilon": 1e-9, "opt_name": "adam", "train_batch_size": 256, "attn_dropout": 0.1, "train_steps": 100, "eval_steps": 10, "max_steps": 500000, "data_path": "datasets/openwebtext/", "scale": 0.20412414523193154, "res_dropout": 0.1, "predict_batch_size": 8, "eval_batch_size": 8, "iterations": 500, "n_embd": 1024, "input": "openwebtext", "model": "GPT2", "model_path": "mymodel", "n_ctx": 1024, "predict_path": "mymodelprediction.txt", "n_layer": 24 }

And i put my tfrecord data in datasets/openwebtext/,change the dir in inputs.py
files = [os.path.join(params["data_path"], "/movie.tfrecords")]

When I started to train the model,
But i get the error, the optimizer failed failed.
Done calling model_fn. I0614 15:52:23.765870 140478902515584 estimator.py:1147] Done calling model_fn. Create CheckpointSaverHook. I0614 15:52:23.768234 140478902515584 basic_session_run_hooks.py:541] Create CheckpointSaverHook. Graph was finalized. I0614 15:52:29.201557 140478902515584 monitored_session.py:240] Graph was finalized. 2019-06-14 15:52:29.207255: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2200000000 Hz 2019-06-14 15:52:29.207564: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2ecd6c0 executing computations on platform Host. Devices: 2019-06-14 15:52:29.207636: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined> 2019-06-14 15:52:43.704444: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile. Running local_init_op. I0614 15:52:57.521352 140478902515584 session_manager.py:500] Running local_init_op. Done running local_init_op. I0614 15:52:57.879982 140478902515584 session_manager.py:502] Done running local_init_op. Saving checkpoints for 0 into mymodel/model.ckpt. I0614 15:53:11.620566 140478902515584 basic_session_run_hooks.py:606] Saving checkpoints for 0 into mymodel/model.ckpt. 2019-06-14 15:55:08.456842: W tensorflow/core/framework/allocator.cc:107] Allocation of 4294967296 exceeds 10% of system memory. 2019-06-14 15:55:12.012810: W tensorflow/core/framework/allocator.cc:107] Allocation of 4294967296 exceeds 10% of system memory. tcmalloc: large alloc 4294967296 bytes == 0x19407a000 @ 0x7fc3cad00b6b 0x7fc3cad20379 0x7fc3b5d3e437 0x7fc3b5cee4bf 0x7fc3b59f08d9 0x7fc3b59f86ff 0x7fc3bccbbbe2 0x7fc3bccbda3e 0x7fc3bccbdc37 0x7fc3bccb6375 0x7fc3bcc5a2e1 0x7fc3bcc5b495 0x7fc3bcb57cec 0x7fc3bcb58ec5 0x7fc3bcb5a880 0x7fc3bcb5d18f 0x7fc3bcb4ec89 0x7fc3bcb50d1b 0x7fc3ba6f573a 0x7fc3ba6f6dc4 0x7fc3ba6f8f41 0x7fc3ba6fa768 0x7fc3b86525fd 0x7fc3ba731e2d 0x7fc3ba732b0c 0x7fc3b864f634 0x7fc3b864f6f2 0x7fc3b860679e 0x502d6f 0x506859 0x502209 2019-06-14 15:55:13.181140: W tensorflow/core/framework/allocator.cc:107] Allocation of 4294967296 exceeds 10% of system memory. tcmalloc: large alloc 4294967296 bytes == 0x19407a000 @ 0x7fc3cad00b6b 0x7fc3cad20379 0x7fc3b5d3e437 0x7fc3b5cee4bf 0x7fc3b59f08d9 0x7fc3b59f86ff 0x7fc3bccbbbe2 0x7fc3bccbda3e 0x7fc3bccbdc37 0x7fc3bccb6375 0x7fc3bcc5a2e1 0x7fc3bcc5b495 0x7fc3bcb57cec 0x7fc3bcb58ec5 0x7fc3bcb5a880 0x7fc3bcb5d18f 0x7fc3bcb4ec89 0x7fc3bcb50d1b 0x7fc3ba6f573a 0x7fc3ba6f6dc4 0x7fc3ba6f8f41 0x7fc3ba6fa768 0x7fc3b86525fd 0x7fc3ba731e2d 0x7fc3ba732b0c 0x7fc3b864f634 0x7fc3b864f6f2 0x7fc3b860679e 0x502d6f 0x506859 0x502209 2019-06-14 15:55:14.330157: W tensorflow/core/framework/allocator.cc:107] Allocation of 4294967296 exceeds 10% of system memory. tcmalloc: large alloc 4294967296 bytes == 0x19407a000 @ 0x7fc3cad00b6b 0x7fc3cad20379 0x7fc3b5d3e437 0x7fc3b5cee4bf 0x7fc3b59f08d9 0x7fc3b59f86ff 0x7fc3bccbbbe2 0x7fc3bccbda3e 0x7fc3bccbdc37 0x7fc3bccb6375 0x7fc3bcc5a2e1 0x7fc3bcc5b495 0x7fc3bcb57cec 0x7fc3bcb58ec5 0x7fc3bcb5a880 0x7fc3bcb5d18f 0x7fc3bcb4ec89 0x7fc3bcb50d1b 0x7fc3ba6f573a 0x7fc3ba6f6dc4 0x7fc3ba6f8f41 0x7fc3ba6fa768 0x7fc3b86525fd 0x7fc3ba731e2d 0x7fc3ba732b0c 0x7fc3b864f634 0x7fc3b864f6f2 0x7fc3b860679e 0x502d6f 0x506859 0x502209 2019-06-14 15:55:15.507901: W tensorflow/core/framework/allocator.cc:107] Allocation of 4294967296 exceeds 10% of system memory. .... 2019-06-14 16:00:04.472067: W tensorflow/core/grappler/optimizers/meta_optimizer.cc:499] arithmetic_optimizer failed: Deadline exceeded: arithmetic_optimizer exceeded deadline., time = 368.891ms. ^C

If anyone got the idea, please help me.

@wjy979769265
Copy link
Author

Running local_init_op. I0615 06:44:22.530402 140075623929728 session_manager.py:500] Running local_init_op. Done running local_init_op. I0615 06:44:22.885511 140075623929728 session_manager.py:502] Done running local_init_op. Saving checkpoints for 0 into lol/model.ckpt. I0615 06:44:36.133187 140075623929728 basic_session_run_hooks.py:606] Saving checkpoints for 0 into lol/model.ckpt. 2019-06-15 06:46:31.778847: W tensorflow/core/framework/allocator.cc:107] Allocation of 4294967296 exceeds 10% of system memory. tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 2019-06-15 06:46:35.063397: W tensorflow/core/framework/allocator.cc:107] Allocation of 4294967296 exceeds 10% of system memory. tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 2019-06-15 06:46:36.083030: W tensorflow/core/framework/allocator.cc:107] Allocation of 4294967296 exceeds 10% of system memory. tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 2019-06-15 06:46:37.083019: W tensorflow/core/framework/allocator.cc:107] Allocation of 4294967296 exceeds 10% of system memory. tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 2019-06-15 06:46:38.082609: W tensorflow/core/framework/allocator.cc:107] Allocation of 4294967296 exceeds 10% of system memory. tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 tcmalloc: large alloc 4294967296 bytes == 0x193eae000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d05828d9 0x7f65d058a6ff 0x7f65d784dbe2 0x7f65d784fa3e 0x7f65d784fc37 0x7f65d7848375 0x7f65d77ec2e1 0x7f65d77ed495 0x7f65d76e9cec 0x7f65d76eaec5 0x7f65d76ec880 0x7f65d76ef18f 0x7f65d76e0c89 0x7f65d76e2d1b 0x7f65d528773a 0x7f65d5288dc4 0x7f65d528af41 0x7f65d528c768 0x7f65d31e45fd 0x7f65d52c3e2d 0x7f65d52c4b0c 0x7f65d31e1634 0x7f65d31e16f2 0x7f65d319879e 0x502d6f 0x506859 0x502209 tcmalloc: large alloc 17179869184 bytes == 0x394786000 @ 0x7f65e5892b6b 0x7f65e58b2379 0x7f65d08d0437 0x7f65d08804bf 0x7f65d0588b2b 0x7f65d0556736 0x7f65d0556c27 0x7f65d0556cf8 0x7f65d7251cab 0x7f65d7255db7 0x7f65d07f54bb 0x7f65d07e7995 0x7f65d089ee99 0x7f65d089bd78 0x7f65e419266f 0x7f65e52746db 0x7f65e55ad88f ^C

Looks like same today, i'm using google colaboratory.

@ConnorJL
Copy link
Owner

Currently this repo is a bit of a mess and finetuning is not as userfriendly as I'd like it to be. I hope to improve things at some point but currently things are messy and you'll have to tweak a lot of code by hand.

The bug you posted is strange, I've never seen it before. Given what I can read from it, I'd guess it means either the CPU is too slow or you run out of RAM.

@wjy979769265
Copy link
Author

Currently this repo is a bit of a mess and finetuning is not as userfriendly as I'd like it to be. I hope to improve things at some point but currently things are messy and you'll have to tweak a lot of code by hand.

The bug you posted is strange, I've never seen it before. Given what I can read from it, I'd guess it means either the CPU is too slow or you run out of RAM.

Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants