[Experiment] Sensitivity to mislabeled result data.

Inspiration
-----
Training examples have a label for the result of the game (z) which we use to train the value head. We set the label to {-1 or 1} representing if the example has from a game that black or white won. We know some percent of our labels are "wrong" (we disagree on precisely what "wrong" means but during v9 we know some games were wrong by both our definitions)

I set out to measure what effect increasing the number of mislabeled values (z) would have on training.

Experiment
---
* Load 10 "golden chunks" (v9 chunk 250 to 259), this is 20 million examples taken from ~600,000 selfplay games from strong v9 models. 
* When training flip the result (from white to black win and vice versa) in some percent {1, 2, 5, 10} of examples.
* Train really fast on a bunch of TPUs at a couple of different network sizes.

Results
---
*TL;DR Flipping 1% and 2% of results doesn't have much impact, 5% and 10% have a much big impact on confidence and value accuracy.

The key takeaway:
* The AG paper says they aim for 5% bad resign in their self play games. This is a trade off between playing more games and slightly better labelled games. 
    * My experiments show that decreasing this to 4%, 2%, 1% would speed up training
    * That (^) being said there are diminishing returns (in value loss) and increasing costs (in computation) to drive this down (given that it might increasing average game length by 20% or more), it would also change the corpus of training data which has an unknown effect on value.

Potential follow ups:
---
* Train with z  to `1 - bad_resign_rate`
* Many moves in resign disabled games have value = 1.0 or -1.0 and MCTS playouts mirrors policy strongly. Maybe we should avoid sampling from these positions.


Data
---
![image](https://user-images.githubusercontent.com/10172976/46502791-d892cf80-c7dd-11e8-9f89-4e804837cabe.png)

![image](https://user-images.githubusercontent.com/10172976/46502811-e5172800-c7dd-11e8-9e75-e4f8aa4b8ea8.png)

![image](https://user-images.githubusercontent.com/10172976/46502821-e9434580-c7dd-11e8-9ee7-0bf8bd2e4270.png)

Holdout
![image](https://user-images.githubusercontent.com/10172976/46502833-f102ea00-c7dd-11e8-9000-a726eccb27a7.png)

Filling own eye when way ahead
https://cloudygo.com/v9-19x19/000000-unused/full/1532637753-tpu-player-deployment-57c689f568-q26qm-29.sgf?M=450
![image](https://user-images.githubusercontent.com/10172976/46503941-4f7d9780-c7e1-11e8-9513-7dd6bca2ed71.png)

Unexpected outcomes versus generation
![image](https://user-images.githubusercontent.com/10172976/46504241-2d384980-c7e2-11e8-9d99-aff38db90d1a.png)

#### What is z
-----
*z* represents "goodness of position for black", we often assume that it's linear and actually represents the approximate change of a winning.

* Andrew takes a "wrong" z to mean the engine result was changed because of something outside of it's control and really only counts this case
  * In v9 we limited games to ~500 moves. If the clearly winning side might pass often and failed to clean up dead groups it might run out of moves and fail to cleaning up a "dead" group which changes the outcome according to Taylor-Tromp.

* Seth takes "wrong" to mean "if two strong players (humans or bots) played the game from this point a hundred times" would their results agree with our result more than half the time.
 * This means that lots of games are "wrong" (i.e. we are teaching the NN something that it will later need to unlearn).

## Helper script
---

```
sethtroisi@sethtroisi:~/minigo$ cat fumble_analysis
WORK_DIR="gs://$USER-sandbox/model"
DATA_DIR="gs://v9-19/data/golden_chunks"

test () {
  BOARD_SIZE=19 python dual_net.py train --use_tpu --tpu_name=sethtroisi --model_dir=$WORK_DIR/fumble_analysis/$1_$2_$3 $DATA_DIR/{250..259}.tfrecord.zz --steps=30720 --iterations_per_loop=128 --summary_steps=256 --trunk_layers=$1 --conv_width=$2 --game_result_fumble_prob=$3
}

test 5 64 0
test 5 64 0.10
test 5 64 0.05
test 5 64 0.02
test 5 64 0.01

test 10 128 0
test 10 128 0.10
test 10 128 0.05
test 10 128 0.02
test 10 128 0.01

test 20 256 0
test 20 256 0.10
test 20 256 0.05
test 20 256 0.02
test 20 256 0.01

test 5  128 0
test 15 128 0
test 20 128 0
test 5  192 0
test 10 192 0
test 15 192 0
test 20 192 0
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Experiment] Sensitivity to mislabeled result data. #483

Inspiration

Experiment

Results

Potential follow ups:

Data

What is z

Helper script

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Experiment] Sensitivity to mislabeled result data. #483

Description

Inspiration

Experiment

Results

Potential follow ups:

Data

What is z

Helper script

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions