-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How are you supposed to use Idx and ThreadId in FitLoading? #120
Comments
Regarding "1 output", is this an integer (such as a class id) or is it a floating point value? |
To be precise the records consist of 192 binary inputs (0 1 1 1 0..) and one floating point output value. |
I have hard-coded the number of threads to 1 in NeuralDefaultThreadCount, and printed Idx in GetTrainingPair to get an idea what is going on, and it seems that Idx refers to the batch as it runs from 1..32 all the time. But what are you supposed to return? I have 100.000.000 records and now? |
I have been digging into the code and in TNeuralDataLoadingFit.RunNNThread it says: BlockSize := FBatchSize div FThreadNum; |
I have increased the number of threads to 2, and based on the output it seems that the calls to TTestFitLoading.GetTestPair are run in parallel as the 'writeln' output sometimes gets garbled. So I could add a global variable nPair that I increment using InterlockedIncrement to keep track of where I am, but when should I set the variable back to 0 to start all over again? Or should I use % 100.000.000? |
"So that explains why Idx runs from 1..32 all the time. So I guess I can ignore Idx" "records somewhere so that the next time TTestFitLoading.GetTestPair is called I return the next record?" Idea: divide your 100M CSV into smaller files (1 per planned thread). Then, you can have one file handler per threadid: handlers: array[0..7] of TextFile. Then, you can read rows with ReadLn(Handlers[ThreadId]). This is an idea only. I'm not saying that this is a good idea nor its a recommendation. One file handler per thread is thread safe. If you have one handler per thread, you won't need to create your own record pointers nor any thread synch. Other idea: keep only one 100M rows CSV file with N file handlers for each thread at different points of the same file. Thinking well, this is better than dividing the file. |
Thanks. Last question: I guess training requires multiple passes over the 100.000.000 records. When do I know to start again from the beginning? Or should I just check for EOF and then close and reopen the CSV file again? I could also bit-pack all records so they would fit in memory, and GetTestPair would unpack the next record. Also in that case: when do I know to start again from the beginning? |
Feel free to keep asking. Your questions are valid. In the case that you decide to fit the dataset in memory (I know people buying notebooks with 128GB of RAM just to fit data), you can load the data randomly. The randomness actually helps with the learning. The other option is running your models in Google Colab using High RAM machines. This is an example for running pascal code in colab: |
I have modified the HypotenuseFitLoading example to support my CSV file as discussed and I have it already working, including loading the CSV completely bit-packed in memory. That also means I have achieved within one day with Neural-API what I could not achieve with TensorFlow over (although I only worked on it in my spare time over a couple of months) say two weeks! I will clean-up the code a bit, perhaps we can add my project to the Examples section. |
@gwiesenekker, thank you very much for your kind comments!!! If you add your code to github or another public repository, I'll be glad to add a link to it in my main readme. In the case that you have trained NNs on public datasets, I can also add a link to the trained NN. |
I guess these are used to select the pair but how? Suppose I have a CSV file with 100.000.000 rows (each row consisting of 100 inputs and 1 output). Could you provide some guidance on how to process this using FitLoading?
The text was updated successfully, but these errors were encountered: