Gamestudio Links
Zorro Links
Newest Posts
Data from CSV not parsed correctly
by EternallyCurious. 04/18/24 10:45
StartWeek not working as it should
by Zheka. 04/18/24 10:11
folder management functions
by VoroneTZ. 04/17/24 06:52
lookback setting performance issue
by 7th_zorro. 04/16/24 03:08
zorro 64bit command line support
by 7th_zorro. 04/15/24 09:36
Zorro FIX plugin - Experimental
by flink. 04/14/24 07:48
Zorro FIX plugin - Experimental
by flink. 04/14/24 07:46
AUM Magazine
Latest Screens
The Bible Game
A psychological thriller game
SHADOW (2014)
DEAD TASTE
Who's Online Now
1 registered members (AndrewAMD), 559 guests, and 2 spiders.
Key: Admin, Global Mod, Mod
Newest Members
EternallyCurious, 11honza11, ccorrea, sakolin, rajesh7827
19046 Registered Users
Previous Thread
Next Thread
Print Thread
Rate Thread
What algorithm is used for BALANCED? #478347
10/07/19 16:58
10/07/19 16:58
Joined: Oct 2018
Posts: 72
J
JamesHH Offline OP
Junior Member
JamesHH  Offline OP
Junior Member
J

Joined: Oct 2018
Posts: 72
The documentation for adviseLong (Machine Learning) states:

"+BALANCED - enforce the same number of positive and negative target values by replication ..." and also in the Remarks:

"... negative and positive Objective values should be equally distributed. If in doubt, add +BALANCED to the method; this will simply copy samples until balance is reached."

This sounds like standard upsampling. However, a simple experiment shows that this is not the case: When I run one of the example scripts, with the SIGNALS method and without BALANCED the generated data has 58061 samples with the following objective stats:

-1 0 1
28769 587 28705

Assuming that 0 counts as negative (which seems to be the case, yes?), the imbalance is 29356 - 28705 = 651 more negatives than positives. Now, I run the exact same script but with +BALANCED, and the data now has 76241 samples with objective stats:

-1 0 1
37257 864 38120

So balanced with 38121 negative versus 38120 positive.

This means that 76241 - 58061 = 18180 samples were added, whereas upsampling would only have added 651 samples.

What is going on? This is *critical* for training ML models, and I can give a specific example if anyone wants.

Re: What algorithm is used for BALANCED? [Re: JamesHH] #478351
10/08/19 10:30
10/08/19 10:30
Joined: Jul 2000
Posts: 27,978
Frankfurt
jcl Offline

Chief Engineer
jcl  Offline

Chief Engineer

Joined: Jul 2000
Posts: 27,978
Frankfurt
No, I think you'll need more than 651 samples for upsampling. It is not sufficient to just add the difference. That would balance them globally, but not locally. Samples should be balanced at any point, while still maintaining their order. This is important when you later split them or process them in batches.

I don't know the details, but suppose the algorithm just runs over the samples and duplicates the last positive whenever there are more negatives, and vice versa. You can see that this adds a lot more than 651 samples.

Re: What algorithm is used for BALANCED? [Re: jcl] #478352
10/08/19 16:23
10/08/19 16:23
Joined: Oct 2018
Posts: 72
J
JamesHH Offline OP
Junior Member
JamesHH  Offline OP
Junior Member
J

Joined: Oct 2018
Posts: 72
I see your point about maintaining the ordering.

However, it does not simply duplicate the last positive (when there are more negatives), or else only 651 samples would have been added above, yes?

It looks like each sample is duplicated a random number of times, both positive and negative. I am guessing that this random number is distributed in such a way that the resulting sequence is approximately balanced (though it seems more exact than this).

I wonder why an algorithm that repeats both positive and negative samples was chosen?

Re: What algorithm is used for BALANCED? [Re: JamesHH] #478355
10/09/19 00:24
10/09/19 00:24
Joined: Oct 2018
Posts: 72
J
JamesHH Offline OP
Junior Member
JamesHH  Offline OP
Junior Member
J

Joined: Oct 2018
Posts: 72
I reread your reply, and now I see that it must be "balanced at every point". I'm not exactly sure how that would be mathematically defined, but it could mean that there is balancing over a sliding window perhaps? But then there probably should be a window size parameter.

Re: What algorithm is used for BALANCED? [Re: JamesHH] #478357
10/09/19 08:11
10/09/19 08:11
Joined: Jul 2000
Posts: 27,978
Frankfurt
jcl Offline

Chief Engineer
jcl  Offline

Chief Engineer

Joined: Jul 2000
Posts: 27,978
Frankfurt
"Any point" means that any sub-range of the samples is still balanced. Random numbers or sliding windows are not used, the algorithm is just as simple as described above.

Re: What algorithm is used for BALANCED? [Re: jcl] #478374
10/09/19 17:26
10/09/19 17:26
Joined: Oct 2018
Posts: 72
J
JamesHH Offline OP
Junior Member
JamesHH  Offline OP
Junior Member
J

Joined: Oct 2018
Posts: 72
Originally Posted by jcl
"Any point" means that any sub-range of the samples is still balanced.


But that begs the question of what a "sub-range" is laugh It can't possibly be balanced on every interval.

Quote
Random numbers or sliding windows are not used, the algorithm is just as simple as described above.


Yes, I reverse-engineered the algorithm (or at least I reproduced BALANCED on two different datasets): At any point, it looks at the *initial segment* of the time series so far, including duplicates, and then if there is an imbalance greater than 1 and if repeating the current sample reduces that imbalance then it duplicates the current sample, but there is a limit of three duplicates for each sample.

What was surprising is that my standard feed-forward NN was barely able to learn on data that was only slightly imbalanced, something like 52%-48%. Yet it learned so well after this simple balancing algorithm. Everything I have read so far about ML seems to focus on the case of extreme imbalances. Also, I looked up quite a few balancing techniques and none of them was anything like the one in Zorro.

Re: What algorithm is used for BALANCED? [Re: JamesHH] #478380
10/10/19 11:34
10/10/19 11:34
Joined: Jul 2000
Posts: 27,978
Frankfurt
jcl Offline

Chief Engineer
jcl  Offline

Chief Engineer

Joined: Jul 2000
Posts: 27,978
Frankfurt
If a NN did not work well with slightly imbalanced data, but worked better with the Zorro balance algorithm, then it probably uses mini-batch gradient descent for learning. Mini-batch requires that not only the whole set, but also small intervals of the samples are still more or less balanced.

Re: What algorithm is used for BALANCED? [Re: JamesHH] #478391
10/11/19 17:32
10/11/19 17:32
Joined: Oct 2018
Posts: 72
J
JamesHH Offline OP
Junior Member
JamesHH  Offline OP
Junior Member
J

Joined: Oct 2018
Posts: 72
Yes, it does use mini-batch gradient descent, which I believe is by far the most common NN setup.

However, I was also shuffling before each epoch. So a standard random upsampling (in my case random sampling 651 samples form the minority class) is just as effective for balancing both locally and globally. And the OOS results were comparable using either standard upsampling or Zorro's balancing algorithm.

The strange thing is that when training with Zorro's algorithm there was a very large overfitting on the training set, but no sign of overfitting with the usual upsampling method. With unshuffled data, the NN could reverse-engineer Zorro's algorithm just as I did which not desirable. I'm not sure what is going on with the shuffled data, but the usual upsampling would seem to be better based on this experiment.


Moderated by  Petra 

Gamestudio download | chip programmers | Zorro platform | shop | Data Protection Policy

oP group Germany GmbH | Birkenstr. 25-27 | 63549 Ronneburg / Germany | info (at) opgroup.de

Powered by UBB.threads™ PHP Forum Software 7.7.1