Daniel Povey
7018c722b5
Let ratio of values to sigmoids be 8, not 2
2022-11-28 21:50:11 +08:00
Daniel Povey
643c547eec
Double just the value dim in NonlinAttentionLayer.
2022-11-28 20:56:47 +08:00
Daniel Povey
88bc45d596
Halve scale on aux_loss
2022-11-28 16:37:46 +08:00
Daniel Povey
cee62c823d
have final prob of aux_loss for input projections be 0
2022-11-28 16:36:17 +08:00
Daniel Povey
9cf5d92f39
Have nonlin_attention and attention_squeeze operate only on every other layer.
2022-11-28 16:24:24 +08:00
Daniel Povey
f483f1e0ef
Implement attention weights sharing for successive layers, for Zipformer
2022-11-28 13:41:11 +08:00
Daniel Povey
121f7e2a45
Documentation fix.
2022-11-28 12:10:08 +08:00
Daniel Povey
c6d859dd05
Increase min_abs of balancer in NonlinAttentionModule from 1.5 to 2.0.
2022-11-28 11:35:00 +08:00
Daniel Povey
39ce60bb7c
Decrease final value of max_abs in AttentionSqueeze from 5.0 to 1.0
2022-11-28 10:45:53 +08:00
Daniel Povey
a3b07fd098
Double aux_grad scale
2022-11-28 00:19:03 +08:00
Daniel Povey
9752778ee6
Use the same schedule for in_proj as out_proj. Only affects a couple of modules.
2022-11-28 00:09:26 +08:00
Daniel Povey
0307252832
Bug fix
2022-11-27 21:33:37 +08:00
Daniel Povey
5128ff8797
Changes to balancer min_abs/max_abs limits.
2022-11-27 21:14:41 +08:00
Daniel Povey
785a524341
Increase in_abs of hidden balancer of ff modules from 0.2 to 1.0
2022-11-27 17:06:31 +08:00
Daniel Povey
2f4df1278d
Have aux_grad_scales for input terminate after 1k batches; double the scale on aux_grad.
2022-11-27 13:56:50 +08:00
Daniel Povey
a6fb9772a8
Remove 4 layers.
2022-11-27 13:29:29 +08:00
Daniel Povey
2e0111e6ef
Halve aux_grad_scale
2022-11-26 23:36:00 +08:00
Daniel Povey
c91014f104
Changes to balancer schedules: start max_abs from 5.0 not 4.0, start min_positive from 0.1 more consistently; finish at 8k not 12k.
2022-11-26 23:10:18 +08:00
Daniel Povey
633b6785f1
Halve final scale of aux_grad, and make schedule decrease more slowly.
2022-11-26 22:27:20 +08:00
Daniel Povey
4874ded2e9
Introduce balancer schedules for the DoubleSwish() in feedforward and conv modules
2022-11-26 20:20:20 +08:00
Daniel Povey
9ce99b150d
Remove one attention_squeeze module; halve dimension in NonlinAttention module; put schedule on balancer of ConvolutionModule
2022-11-26 19:42:33 +08:00
Daniel Povey
a96b92fb54
Make alpha for LinearWithAuxLossFunction be in log space; simplify/rework NonlinAttentionModule, setup more like ConvModule now.
2022-11-26 19:38:29 +08:00
Daniel Povey
e19118a966
Merge branch 'scaled_adam_exp503' into scaled_adam_exp505
2022-11-26 19:29:58 +08:00
Daniel Povey
faed28ba6a
Changes for debugging/stats.
2022-11-26 18:59:15 +08:00
Daniel Povey
8858fb38f1
Halve expected value of aux_grad scale, and implement it more efficiently, via a scale on the prob of using it.
2022-11-26 14:52:59 +08:00
Daniel Povey
5f80807027
Add LinearWithAuxLoss in nonlin_attention and AttentionSqueeze modules.
2022-11-26 14:15:09 +08:00
Daniel Povey
4058d56c0d
Remove squeeze_excite from Conv2dSubsampling.
2022-11-26 14:04:41 +08:00
Daniel Povey
281b54e7bf
Use LinearWithAuxLoss in more places.
2022-11-26 12:25:22 +08:00
Daniel Povey
d9c7e4f216
Make the in_proj of feedforward modules also be a LinearWithAuxLoss.
2022-11-26 12:13:31 +08:00
Daniel Povey
029f5869c4
increase schedule init from 0.1 to 0.2
2022-11-25 18:06:13 +08:00
Daniel Povey
2368968114
Make out_proj of feedforward modules be a LinearWithAuxLoss, with nonzero final value at 0.01.
2022-11-25 18:00:46 +08:00
Daniel Povey
8f1ef60951
Integrate LinearWithAuxLoss into SqueezeExcite1d
2022-11-25 16:24:28 +08:00
Daniel Povey
6a91f343e9
Use LinearWithAuxLoss in squeeze-attention module
2022-11-25 16:04:51 +08:00
Daniel Povey
ba348169bf
Change for diagnostic purposes, sigmoid of NonlinAttention.
2022-11-25 12:39:16 +08:00
Daniel Povey
0614f65428
Bug fix, remove 2nd activation in a row
2022-11-24 17:20:28 +08:00
Daniel Povey
534eca4bf3
Add 1d squeeze and excite (-like) module in Conv2dSubsampling
2022-11-24 16:18:40 +08:00
Daniel Povey
dd3826104e
Start whitening schedules for activation in NonlinAttentionModule, AttentionSqueezeModule lower; increase some whitening probs.
2022-11-24 15:25:59 +08:00
Daniel Povey
0ac26f4234
Increase initial whitening target for self_attn from 2.0 to 3.0.
2022-11-24 15:18:28 +08:00
Daniel Povey
45069175d9
Add a second whitening to the NonlinAttentionModule, after the aggregation.
2022-11-24 14:16:13 +08:00
Daniel Povey
35f0ea0015
Changes to whitening modules for memory efficiency, moving them inside; increase their prob.
2022-11-24 13:47:22 +08:00
Daniel Povey
de73e2e424
Move whitening of NonlinAttentionModule from the output to the interior just apply to the value.
2022-11-24 13:27:32 +08:00
Daniel Povey
ee61ec63b3
Introduce schedules for whitening.
2022-11-23 19:49:34 +08:00
Daniel Povey
a6657e6b40
Harmonize whitening modules, adding them to 3 submodules and changing configuration on 2 others and location in NonlinAttention.
2022-11-23 19:08:19 +08:00
Daniel Povey
9ceb41acb4
Remove balancer from SelfAttention module.
2022-11-23 18:41:36 +08:00
Daniel Povey
f2dbf87461
Remove invocation of out_balancer
2022-11-23 18:40:27 +08:00
Daniel Povey
b88f12fe83
Remove out_balancer of NonlinAttentionModule
2022-11-23 18:37:45 +08:00
Daniel Povey
9138695dfe
Fix bug RE attn_weights
2022-11-23 17:04:17 +08:00
Daniel Povey
36e49a8d61
Change for mem efficiency
2022-11-23 15:38:34 +08:00
Daniel Povey
1d0252d420
Merge branch 'scaled_adam_exp466' into scaled_adam_exp472.
...
Below is a more complete list of the changes I am making, although some of
these may be counted in the last
numbers XXX below correspond to branches numbered scaled_adam_expXXX.
- from 412/413 (cherry-picked): dropout for attention in attention_squeeze and nonlin_attention modules,
but simplified this a little to use the same dropout schedule and drop them out all together
also have all 3 submodules use separate heads.
- from 460->461, which is in the history of 464, revert the part about balancing output out attention_squeeze module.
- merge from 462->467, about using TanSwish not tanh.
- merge 462->465, remove whitening in self-attention module
- merge the part of 465->466 that was about diagnostics (name in Whiten module)
2022-11-23 14:41:09 +08:00
Daniel Povey
f89a85aed8
Merge branch 'scaled_adam_exp465' into scaled_adam_exp472
2022-11-23 14:16:17 +08:00