What algorithm is used for BALANCED?

Posted By: JamesHH

What algorithm is used for BALANCED? - 10/07/19 16:58

The documentation for adviseLong (Machine Learning) states:

"+BALANCED - enforce the same number of positive and negative target values by replication ..." and also in the Remarks:

"... negative and positive Objective values should be equally distributed. If in doubt, add +BALANCED to the method; this will simply copy samples until balance is reached."

This sounds like standard upsampling. However, a simple experiment shows that this is not the case: When I run one of the example scripts, with the SIGNALS method and without BALANCED the generated data has 58061 samples with the following objective stats:

-1 0 1
28769 587 28705

Assuming that 0 counts as negative (which seems to be the case, yes?), the imbalance is 29356 - 28705 = 651 more negatives than positives. Now, I run the exact same script but with +BALANCED, and the data now has 76241 samples with objective stats:

-1 0 1
37257 864 38120

So balanced with 38121 negative versus 38120 positive.

This means that 76241 - 58061 = 18180 samples were added, whereas upsampling would only have added 651 samples.

What is going on? This is *critical* for training ML models, and I can give a specific example if anyone wants.

Posted By: jcl

Re: What algorithm is used for BALANCED? - 10/08/19 10:30

No, I think you'll need more than 651 samples for upsampling. It is not sufficient to just add the difference. That would balance them globally, but not locally. Samples should be balanced at any point, while still maintaining their order. This is important when you later split them or process them in batches.

I don't know the details, but suppose the algorithm just runs over the samples and duplicates the last positive whenever there are more negatives, and vice versa. You can see that this adds a lot more than 651 samples.

Posted By: JamesHH

Re: What algorithm is used for BALANCED? - 10/08/19 16:23

I see your point about maintaining the ordering.

However, it does not simply duplicate the last positive (when there are more negatives), or else only 651 samples would have been added above, yes?

It looks like each sample is duplicated a random number of times, both positive and negative. I am guessing that this random number is distributed in such a way that the resulting sequence is approximately balanced (though it seems more exact than this).

I wonder why an algorithm that repeats both positive and negative samples was chosen?

Posted By: JamesHH

Re: What algorithm is used for BALANCED? - 10/09/19 00:24

I reread your reply, and now I see that it must be "balanced at every point". I'm not exactly sure how that would be mathematically defined, but it could mean that there is balancing over a sliding window perhaps? But then there probably should be a window size parameter.

Posted By: jcl

Re: What algorithm is used for BALANCED? - 10/09/19 08:11

"Any point" means that any sub-range of the samples is still balanced. Random numbers or sliding windows are not used, the algorithm is just as simple as described above.

Posted By: JamesHH

Re: What algorithm is used for BALANCED? - 10/09/19 17:26

Originally Posted by jcl

"Any point" means that any sub-range of the samples is still balanced.

But that begs the question of what a "sub-range" is

It can't possibly be balanced on every interval.

Quote

Random numbers or sliding windows are not used, the algorithm is just as simple as described above.

Yes, I reverse-engineered the algorithm (or at least I reproduced BALANCED on two different datasets): At any point, it looks at the *initial segment* of the time series so far, including duplicates, and then if there is an imbalance greater than 1 and if repeating the current sample reduces that imbalance then it duplicates the current sample, but there is a limit of three duplicates for each sample.

What was surprising is that my standard feed-forward NN was barely able to learn on data that was only slightly imbalanced, something like 52%-48%. Yet it learned so well after this simple balancing algorithm. Everything I have read so far about ML seems to focus on the case of extreme imbalances. Also, I looked up quite a few balancing techniques and none of them was anything like the one in Zorro.

Posted By: jcl

Re: What algorithm is used for BALANCED? - 10/10/19 11:34

If a NN did not work well with slightly imbalanced data, but worked better with the Zorro balance algorithm, then it probably uses mini-batch gradient descent for learning. Mini-batch requires that not only the whole set, but also small intervals of the samples are still more or less balanced.

Posted By: JamesHH

Re: What algorithm is used for BALANCED? - 10/11/19 17:32

Yes, it does use mini-batch gradient descent, which I believe is by far the most common NN setup.

However, I was also shuffling before each epoch. So a standard random upsampling (in my case random sampling 651 samples form the minority class) is just as effective for balancing both locally and globally. And the OOS results were comparable using either standard upsampling or Zorro's balancing algorithm.

The strange thing is that when training with Zorro's algorithm there was a very large overfitting on the training set, but no sign of overfitting with the usual upsampling method. With unshuffled data, the NN could reverse-engineer Zorro's algorithm just as I did which not desirable. I'm not sure what is going on with the shuffled data, but the usual upsampling would seem to be better based on this experiment.