178 Commits

Author SHA1 Message Date
Daniel Povey
5632782ee1 Merge branch 'scaled_adam_exp539' into scaled_adam_exp548 2022-11-29 15:40:23 +08:00
Daniel Povey
b90d8aabde Revert the alternate-layers-only thing for nonlin_attention and attention_squeeze 2022-11-29 15:38:55 +08:00
Daniel Povey
753269668a Change ratio in NonlinAttentionModule from 8 to 2 2022-11-29 15:38:13 +08:00
Daniel Povey
93942725c4 Increase min_abs of balancer of encoder layer from 0.2 to 0.4. 2022-11-29 13:46:47 +08:00
Daniel Povey
36a2f33a6f Have value dim in NonlinAttentionModule be half of num_channels 2022-11-28 21:55:06 +08:00
Daniel Povey
258d4f1353 Let ratio be 8, not 2, for sigmoid in NonlinAttentionModule 2022-11-28 21:51:29 +08:00
Daniel Povey
7018c722b5 Let ratio of values to sigmoids be 8, not 2 2022-11-28 21:50:11 +08:00
Daniel Povey
643c547eec Double just the value dim in NonlinAttentionLayer. 2022-11-28 20:56:47 +08:00
Daniel Povey
88bc45d596 Halve scale on aux_loss 2022-11-28 16:37:46 +08:00
Daniel Povey
cee62c823d have final prob of aux_loss for input projections be 0 2022-11-28 16:36:17 +08:00
Daniel Povey
9cf5d92f39 Have nonlin_attention and attention_squeeze operate only on every other layer. 2022-11-28 16:24:24 +08:00
Daniel Povey
f483f1e0ef Implement attention weights sharing for successive layers, for Zipformer 2022-11-28 13:41:11 +08:00
Daniel Povey
121f7e2a45 Documentation fix. 2022-11-28 12:10:08 +08:00
Daniel Povey
c6d859dd05 Increase min_abs of balancer in NonlinAttentionModule from 1.5 to 2.0. 2022-11-28 11:35:00 +08:00
Daniel Povey
39ce60bb7c Decrease final value of max_abs in AttentionSqueeze from 5.0 to 1.0 2022-11-28 10:45:53 +08:00
Daniel Povey
a3b07fd098 Double aux_grad scale 2022-11-28 00:19:03 +08:00
Daniel Povey
9752778ee6 Use the same schedule for in_proj as out_proj. Only affects a couple of modules. 2022-11-28 00:09:26 +08:00
Daniel Povey
0307252832 Bug fix 2022-11-27 21:33:37 +08:00
Daniel Povey
5128ff8797 Changes to balancer min_abs/max_abs limits. 2022-11-27 21:14:41 +08:00
Daniel Povey
785a524341 Increase in_abs of hidden balancer of ff modules from 0.2 to 1.0 2022-11-27 17:06:31 +08:00
Daniel Povey
2f4df1278d Have aux_grad_scales for input terminate after 1k batches; double the scale on aux_grad. 2022-11-27 13:56:50 +08:00
Daniel Povey
a6fb9772a8 Remove 4 layers. 2022-11-27 13:29:29 +08:00
Daniel Povey
2e0111e6ef Halve aux_grad_scale 2022-11-26 23:36:00 +08:00
Daniel Povey
c91014f104 Changes to balancer schedules: start max_abs from 5.0 not 4.0, start min_positive from 0.1 more consistently; finish at 8k not 12k. 2022-11-26 23:10:18 +08:00
Daniel Povey
633b6785f1 Halve final scale of aux_grad, and make schedule decrease more slowly. 2022-11-26 22:27:20 +08:00
Daniel Povey
4874ded2e9 Introduce balancer schedules for the DoubleSwish() in feedforward and conv modules 2022-11-26 20:20:20 +08:00
Daniel Povey
9ce99b150d Remove one attention_squeeze module; halve dimension in NonlinAttention module; put schedule on balancer of ConvolutionModule 2022-11-26 19:42:33 +08:00
Daniel Povey
a96b92fb54 Make alpha for LinearWithAuxLossFunction be in log space; simplify/rework NonlinAttentionModule, setup more like ConvModule now. 2022-11-26 19:38:29 +08:00
Daniel Povey
e19118a966 Merge branch 'scaled_adam_exp503' into scaled_adam_exp505 2022-11-26 19:29:58 +08:00
Daniel Povey
faed28ba6a Changes for debugging/stats. 2022-11-26 18:59:15 +08:00
Daniel Povey
8858fb38f1 Halve expected value of aux_grad scale, and implement it more efficiently, via a scale on the prob of using it. 2022-11-26 14:52:59 +08:00
Daniel Povey
5f80807027 Add LinearWithAuxLoss in nonlin_attention and AttentionSqueeze modules. 2022-11-26 14:15:09 +08:00
Daniel Povey
4058d56c0d Remove squeeze_excite from Conv2dSubsampling. 2022-11-26 14:04:41 +08:00
Daniel Povey
281b54e7bf Use LinearWithAuxLoss in more places. 2022-11-26 12:25:22 +08:00
Daniel Povey
d9c7e4f216 Make the in_proj of feedforward modules also be a LinearWithAuxLoss. 2022-11-26 12:13:31 +08:00
Daniel Povey
029f5869c4 increase schedule init from 0.1 to 0.2 2022-11-25 18:06:13 +08:00
Daniel Povey
2368968114 Make out_proj of feedforward modules be a LinearWithAuxLoss, with nonzero final value at 0.01. 2022-11-25 18:00:46 +08:00
Daniel Povey
8f1ef60951 Integrate LinearWithAuxLoss into SqueezeExcite1d 2022-11-25 16:24:28 +08:00
Daniel Povey
6a91f343e9 Use LinearWithAuxLoss in squeeze-attention module 2022-11-25 16:04:51 +08:00
Daniel Povey
ba348169bf Change for diagnostic purposes, sigmoid of NonlinAttention. 2022-11-25 12:39:16 +08:00
Daniel Povey
0614f65428 Bug fix, remove 2nd activation in a row 2022-11-24 17:20:28 +08:00
Daniel Povey
534eca4bf3 Add 1d squeeze and excite (-like) module in Conv2dSubsampling 2022-11-24 16:18:40 +08:00
Daniel Povey
dd3826104e Start whitening schedules for activation in NonlinAttentionModule, AttentionSqueezeModule lower; increase some whitening probs. 2022-11-24 15:25:59 +08:00
Daniel Povey
0ac26f4234 Increase initial whitening target for self_attn from 2.0 to 3.0. 2022-11-24 15:18:28 +08:00
Daniel Povey
45069175d9 Add a second whitening to the NonlinAttentionModule, after the aggregation. 2022-11-24 14:16:13 +08:00
Daniel Povey
35f0ea0015 Changes to whitening modules for memory efficiency, moving them inside; increase their prob. 2022-11-24 13:47:22 +08:00
Daniel Povey
de73e2e424 Move whitening of NonlinAttentionModule from the output to the interior just apply to the value. 2022-11-24 13:27:32 +08:00
Daniel Povey
ee61ec63b3 Introduce schedules for whitening. 2022-11-23 19:49:34 +08:00
Daniel Povey
a6657e6b40 Harmonize whitening modules, adding them to 3 submodules and changing configuration on 2 others and location in NonlinAttention. 2022-11-23 19:08:19 +08:00
Daniel Povey
9ceb41acb4 Remove balancer from SelfAttention module. 2022-11-23 18:41:36 +08:00