I find there are some inf in tot_score, it makes model cannot converge, add inf mask can make training more stable.
grad_scale is too small