What algorithm is used for BALANCED?

Gamestudio Links

Zorro Links

Data from CSV not parsed correctly
by EternallyCurious. 04/18/24 10:45

folder management functions
by VoroneTZ. 04/17/24 06:52

lookback setting performance issue
by 7th_zorro. 04/16/24 03:08

zorro 64bit command line support
by 7th_zorro. 04/15/24 09:36

AUM Magazine

Latest Screens

1 registered members (AndrewAMD), 559 guests, and 2 spiders.

Key: Admin, Global Mod, Mod

Newest Members

Print Thread

Rate Thread

What algorithm is used for BALANCED? #478347 10/07/19 16:58 10/07/19 16:58
Joined: Oct 2018 Posts: 72 J JamesHH OP Junior Member
JamesHH OP Junior Member J Joined: Oct 2018 Posts: 72	The documentation for adviseLong (Machine Learning) states: "+BALANCED - enforce the same number of positive and negative target values by replication ..." and also in the Remarks: "... negative and positive Objective values should be equally distributed. If in doubt, add +BALANCED to the method; this will simply copy samples until balance is reached." This sounds like standard upsampling. However, a simple experiment shows that this is not the case: When I run one of the example scripts, with the SIGNALS method and without BALANCED the generated data has 58061 samples with the following objective stats: -1 0 1 28769 587 28705 Assuming that 0 counts as negative (which seems to be the case, yes?), the imbalance is 29356 - 28705 = 651 more negatives than positives. Now, I run the exact same script but with +BALANCED, and the data now has 76241 samples with objective stats: -1 0 1 37257 864 38120 So balanced with 38121 negative versus 38120 positive. This means that 76241 - 58061 = 18180 samples were added, whereas upsampling would only have added 651 samples. What is going on? This is critical for training ML models, and I can give a specific example if anyone wants.

Re: What algorithm is used for BALANCED? [Re: JamesHH] #478351 10/08/19 10:30 10/08/19 10:30
Joined: Jul 2000 Posts: 27,978 Frankfurt jcl Chief Engineer
jcl Chief Engineer Joined: Jul 2000 Posts: 27,978 Frankfurt	No, I think you'll need more than 651 samples for upsampling. It is not sufficient to just add the difference. That would balance them globally, but not locally. Samples should be balanced at any point, while still maintaining their order. This is important when you later split them or process them in batches. I don't know the details, but suppose the algorithm just runs over the samples and duplicates the last positive whenever there are more negatives, and vice versa. You can see that this adds a lot more than 651 samples.

Re: What algorithm is used for BALANCED? [Re: jcl] #478352 10/08/19 16:23 10/08/19 16:23
Joined: Oct 2018 Posts: 72 J JamesHH OP Junior Member
JamesHH OP Junior Member J Joined: Oct 2018 Posts: 72	I see your point about maintaining the ordering. However, it does not simply duplicate the last positive (when there are more negatives), or else only 651 samples would have been added above, yes? It looks like each sample is duplicated a random number of times, both positive and negative. I am guessing that this random number is distributed in such a way that the resulting sequence is approximately balanced (though it seems more exact than this). I wonder why an algorithm that repeats both positive and negative samples was chosen?

Re: What algorithm is used for BALANCED? [Re: JamesHH] #478355 10/09/19 00:24 10/09/19 00:24
Joined: Oct 2018 Posts: 72 J JamesHH OP Junior Member
JamesHH OP Junior Member J Joined: Oct 2018 Posts: 72	I reread your reply, and now I see that it must be "balanced at every point". I'm not exactly sure how that would be mathematically defined, but it could mean that there is balancing over a sliding window perhaps? But then there probably should be a window size parameter.

Re: What algorithm is used for BALANCED? [Re: JamesHH] #478357 10/09/19 08:11 10/09/19 08:11
Joined: Jul 2000 Posts: 27,978 Frankfurt jcl Chief Engineer
jcl Chief Engineer Joined: Jul 2000 Posts: 27,978 Frankfurt	"Any point" means that any sub-range of the samples is still balanced. Random numbers or sliding windows are not used, the algorithm is just as simple as described above.

Re: What algorithm is used for BALANCED? [Re: jcl] #478374 10/09/19 17:26 10/09/19 17:26
Joined: Oct 2018 Posts: 72 J JamesHH OP Junior Member
JamesHH OP Junior Member J Joined: Oct 2018 Posts: 72	Originally Posted by jcl "Any point" means that any sub-range of the samples is still balanced. But that begs the question of what a "sub-range" is It can't possibly be balanced on every interval. Quote Random numbers or sliding windows are not used, the algorithm is just as simple as described above. Yes, I reverse-engineered the algorithm (or at least I reproduced BALANCED on two different datasets): At any point, it looks at the initial segment of the time series so far, including duplicates, and then if there is an imbalance greater than 1 and if repeating the current sample reduces that imbalance then it duplicates the current sample, but there is a limit of three duplicates for each sample. What was surprising is that my standard feed-forward NN was barely able to learn on data that was only slightly imbalanced, something like 52%-48%. Yet it learned so well after this simple balancing algorithm. Everything I have read so far about ML seems to focus on the case of extreme imbalances. Also, I looked up quite a few balancing techniques and none of them was anything like the one in Zorro.

Re: What algorithm is used for BALANCED? [Re: JamesHH] #478380 10/10/19 11:34 10/10/19 11:34
Joined: Jul 2000 Posts: 27,978 Frankfurt jcl Chief Engineer
jcl Chief Engineer Joined: Jul 2000 Posts: 27,978 Frankfurt	If a NN did not work well with slightly imbalanced data, but worked better with the Zorro balance algorithm, then it probably uses mini-batch gradient descent for learning. Mini-batch requires that not only the whole set, but also small intervals of the samples are still more or less balanced.

Re: What algorithm is used for BALANCED? [Re: JamesHH] #478391 10/11/19 17:32 10/11/19 17:32
Joined: Oct 2018 Posts: 72 J JamesHH OP Junior Member
JamesHH OP Junior Member J Joined: Oct 2018 Posts: 72	Yes, it does use mini-batch gradient descent, which I believe is by far the most common NN setup. However, I was also shuffling before each epoch. So a standard random upsampling (in my case random sampling 651 samples form the minority class) is just as effective for balancing both locally and globally. And the OOS results were comparable using either standard upsampling or Zorro's balancing algorithm. The strange thing is that when training with Zorro's algorithm there was a very large overfitting on the training set, but no sign of overfitting with the usual upsampling method. With unshuffled data, the NN could reverse-engineer Zorro's algorithm just as I did which not desirable. I'm not sure what is going on with the shuffled data, but the usual upsampling would seem to be better based on this experiment.

Moderated by Petra