-
Notifications
You must be signed in to change notification settings - Fork 565
Could the ineffectiveness of black's rollouts be an artifact? #139
Comments
thoughtful |
Very good analysis! The intuition is that a forced even distribution of black/white win implicitly uses a posterior value for the value estimate. Note that the value function in OpenGo is [-1, 1] so you might need to make some changes in your analysis. This might also explain the auto-tuning/zig-zag effects: whenever P(blackwin) < 0.5, the posterior P(position | black_win) is larger than it should be, which could cause white to resign on more games and reverse this trend (i.e., negative feedback). An evaluation test can be done by using player-sensitive correction (which is a multiplication of the current Q value by a constant). This should be easy to do. On the other hand, if you convert it to p_uct, it would be quite nonlinear. (I feel that I was reading a thermodynamics paper :D) |
On the other hand, I am not sure whether this can affect black rollout efficiency. It might be the case that since black's Q is biased, more rollouts do not give the correct estimation of V (essentially V^pi) and thus more rollouts do not help. |
I'm glad you find this analysis interesting! The choice of letters is a bit funny in hindsight :) The negative feedback through resignation is a good point and it likely has a huge impact; I never thought of that. I don't think the correction needs to be player-sensitive; it's as easy as inverting the one-to-one transformation Q->E to get Q = rE / (rE + 1-E). You can't quite do the correction by multiplying a constant, since dQ/dE varies as E (or Q). (With Q and E in [-1,1], the formula becomes Q = (r(1+E) - (1-E))/(r(1+E) + (1-E)).) I think the incorrectness of winrate alone cannot explain the black rollout inefficiency. Black and white's winrates are affected equally by the sampling scheme; when Q is replaced by E, 1-Q is also replaced by 1-E. I chose to use black's winrate only for convenience; the winrate is as incorrect for black as it is for white. The interval [0,1] is simply transformed one-to-one onto [0,1], extended (dE/dQ > 1) near 0, and compressed (dE/dQ < 1) near 1, which can be roughly equivalently thought of making c_puct a function of Q. The only disparity between black and white is that when black is leading the winrate gets compressed (or c_puct gets bigger), and when white is leading the winrate gets extended (or c_puct gets smaller). The higher c_puct could be unfavorable for rollout efficiency (which is only my hypothesis) and lower the strength gap between black with higher rollouts and white with lower rollouts, leading to black failing to convert the favorable position into a win. The ratio of maximum extension and maximum compression is 1/r^2 which is independent of whether I choose Q to lie in [0,1] or [-1,1]. |
ah, ic. Yes, the final trained value function is actually not a proper posterior. So the transformation is nonlinear and thus is Q (or E) dependent. Applying the transformation back would yield the correct answer (rather than using effective p_uct, which stems from local linear approximation). |
The analysis on black rollout efficiency would be dependent on where the optimal p_uct is. It is hard to draw a conclusion here. We might need to correct the value function first in the evaluation and then check whether this still exists. |
Today the results of an independent test of rollout efficiency were published on the LZ forum leela-zero/leela-zero#2236. From the diagrams/graphs, ELF v0 and v1 (but not v2) were included in the tests, using the LZ engine presumably, and all the models (as black) with varied rollouts challenge v0 (as white). It seems that v0 and v1 also enjoyed ~250 Elo gain as black per rollout doubling. The difference between LZ and ELF engines should just be the FPU reduction (makes search a bit narrower) and a higher c_puct (0.8, equivalent to 1.6 in ELF). Not sure what can be concluded yet. |
I submitted applications to three positions (SWE or AI) at Facebook 3 weeks ago (with the email [email protected]) through Facebook Careers and still haven't received any response. Would be nice if someone can ping a recruiter to give me a response. Sorry for the OT. |
From the paper:
I wonder whether this behavior could be related to the way training data is sampled:
Theoretical analysis of the effect of ELF's sampling scheme
Let Q = P( black wins | position ) = the actual black winrate at a particular position
Let E = the trained black winrate at the position with ELF's sampling scheme
Then E can be calculated as follows:
Let S be the sample size, so there are S/2 samples that are won by black and S/2 won by white.
So there are
S/2 x P( position | black wins ) samples
with that position that are won by black, and
S/2 x P( position | white wins ) samples
with that position that are won by white.
The total number of samples with that position is the sum.
Note that
P( position | black wins ) = P( black wins | position ) x P(position) / P(black wins) = P(pos) x Q / pB
P( position | white wins ) = P( white wins | position ) x P(position) / P(white wins) = P(pos) x (1-Q) / (1-pB)
if we denote P(black wins), the overall black winrate, by pB.
We immediately see that the percentage of samples that are won by black among samples with that position is
E = (Q/pB) / (Q/pB + (1-Q)/(1-pB)) = Q / (Q + r(1-Q)) = 1 / (1 + r(1/Q-1))
(where r = pB/pW = pB/(1-pB) = 1/pW - 1),
which can be written in the more symmetric form
1/E - 1 = r(1/Q - 1) or (1/pB - 1) (1/E - 1) = 1/Q - 1.
So the effect this sampling scheme is to transform Q into E by the formula. Note that in the tree search, we always pick the move with the largest Q+U, which is now replaced by E+U. Then the increment Δ(Q+U) is replaced by Δ(E+U) = ΔE+ΔU = (ΔE/ΔQ)ΔQ+ΔU. It's common to face a decision between two moves of close winrates, so we assume that ΔQ is small and thus ΔE/ΔQ can be approximated by dE/dQ, so ΔE+ΔU = (dE/dQ) (ΔQ + ΔU/(dE/dQ)). We see that picking the move based on E+U is the same as picking based on Q + U/(dE/dQ), i.e. replacing Q by E has the same effect as dividing c = c_puct = 1.5 by dE/dQ, and I'll call the result effective c_puct. It is easy to compute
dE/dQ = r / (Q + r(1-Q))^2,
which is monotonically decreasing.
At Q = 0 (white wins), we have dE/dQ = 1/r > 1;
at Q = 1 (black wins), we have dE/dQ = r < 1.
and at the initial position (Q = pB), we have dE/dQ = 1/(4 x pB x pW).
For black to win, its winrate needs to march from Q = pB to Q = 1, and the effective c_puct would then increase from the initial value to c/r. On the other hand, if white is to win, the effective c_puct would decrease from the initial value to cr. As is well known, c_puct balances exploration and exploitation and is crucial to effective scaling of strength with rollouts. Therefore, if a higher c_puct is detrimental and a lower c_puct is beneficial to rollout-strength scaling, then black would be at a disadvantage.
As an example, the released v2 model, # 1290, produces 41.4% self-play games won by black, so pB = 0.414 and r = 0.706. At Q = 0 and Q =1, the effective c_puct differ by 1/r^2 = 2.00, which is quite significant. The penultimate model # 1499 is even more extreme at pB = 0.329.
Suggestions
Retest black/white rollout-strength scaling with E transformed back to Q (though it may be difficult to decide what r should be since it varies across the whole run).
Test with Leela Zero's 256x40 models or Minigo's v15 256x19 models, and see if a similar effect is present.
Test ELF models against Minigo models of similar strength.
Tune c_puct for ELF models as black and as white separately.
Thank you for the release of the datasets, models, and the wonderful paper, and I just want to make it even better!
The text was updated successfully, but these errors were encountered: