Scale up diag of grad_cov by 1.0001 prior to diagonalizing it.

2025-09-19 05:54:20 +00:00 · 2022-08-06 07:06:23 +08:00 · 2022-08-06 07:06:23 +08:00 · 9bbf8ada57
commit 9bbf8ada57
parent c021b4fec6
1 changed files with 9 additions and 2 deletions
--- a/egs/librispeech/ASR/pruned_transducer_stateless7/optim.py
+++ b/egs/librispeech/ASR/pruned_transducer_stateless7/optim.py
@ -684,8 +684,15 @@ param_rms_smooth1: Smoothing proportion for parameter matrix, if assumed rank of
            # that takes normally distributed data to P, so we can use
            #  C U for any orthogonal U, since C U I U^T C^T == P.
            # So there is no harm in choosing a matrix U that diagonalizes the
-            # projected grad_cov.  grad_cov gets projected by
-
+            # projected grad_cov.
+            grad_cov = grad_cov.clone()
+            # Scale up the diagonal of grad_cov; this is to prevent optimizing on directions with low gradient
+            # covariance that would suffer excessively from roundoff in the grads, i.e. directions where the
+            # gradient direction would be dominated by gradient noise from the projection.  This is an issue
+            # for layers preceding softmax layers, where the direction that is the sum of all the axes
+            # has zero gradient; we can observe large magnitudes along this direction when training using
+            # this type of method.
+            _diag(grad_cov).mul_(1.0001)
            grad_cov_proj = torch.matmul(C.transpose(2, 3),
                                         torch.matmul(grad_cov, C))