Fangjun Kuang 855c76655b
Add zipformer from Dan using multi-dataset setup (#675)
* Bug fix

* Change subsamplling factor from 1 to 2

* Implement AttentionCombine as replacement for RandomCombine

* Decrease random_prob from 0.5 to 0.333

* Add print statement

* Apply single_prob mask, so sometimes we just get one layer as output.

* Introduce feature mask per frame

* Include changes from Liyong about padding conformer module.

* Reduce single_prob from 0.5 to 0.25

* Reduce feature_mask_dropout_prob from 0.25 to 0.15.

* Remove dropout from inside ConformerEncoderLayer, for adding to residuals

* Increase feature_mask_dropout_prob from 0.15 to 0.2.

* Swap random_prob and single_prob, to reduce prob of being randomized.

* Decrease feature_mask_dropout_prob back from 0.2 to 0.15, i.e. revert the 43->48 change.

* Randomize order of some modules

* Bug fix

* Stop backprop bug

* Introduce a scale dependent on the masking value

* Implement efficient layer dropout

* Simplify the learned scaling factor on the modules

* Compute valid loss on batch 0.

* Make the scaling factors more global and the randomness of dropout more random

* Bug fix

* Introduce offset in layerdrop_scaleS

* Remove final combination; implement layer drop that drops the final layers.

* Bug fices

* Fix bug RE self.training

* Fix bug setting layerdrop mask

* Fix eigs call

* Add debug info

* Remove warmup

* Remove layer dropout and model-level warmup

* Don't always apply the frame mask

* Slight code cleanup/simplification

* Various fixes, finish implementating frame masking

* Remove debug info

* Don't compute validation if printing diagnostics.

* Apply layer bypass during warmup in a new way, including 2s and 4s of layers.

* Update checkpoint.py to deal with int params

* Revert initial_scale to previous values.

* Remove the feature where it was bypassing groups of layers.

* Implement layer dropout with probability 0.075

* Fix issue with warmup in test time

* Add warmup schedule where dropout disappears from earlier layers first.

* Have warmup that gradually removes dropout from layers; multiply initialization scales by 0.1.

* Do dropout a different way

* Fix bug in warmup

* Remove debug print

* Make the warmup mask per frame.

* Implement layer dropout (in a relatively efficient way)

* Decrease initial keep_prob to 0.25.

* Make it start warming up from the very start, and increase warmup_batches to 6k

* Change warmup schedule and increase warmup_batches from 4k to 6k

* Make the bypass scale trainable.

* Change the initial keep-prob back from 0.25 to 0.5

* Bug fix

* Limit bypass scale to >= 0.1

* Revert "Change warmup schedule and increase warmup_batches from 4k to 6k"

This reverts commit 86845bd5d859ceb6f83cd83f3719c3e6641de987.

* Do warmup by dropping out whole layers.

* Decrease frequency of logging variance_proportion

* Make layerdrop different in different processes.

* For speed, drop the same num layers per job.

* Decrease initial_layerdrop_prob from 0.75 to 0.5

* Revert also the changes in scaled_adam_exp85 regarding warmup schedule

* Remove unused code LearnedScale.

* Reintroduce batching to the optimizer

* Various fixes from debugging with nvtx, but removed the NVTX annotations.

* Only apply ActivationBalancer with prob 0.25.

* Fix s -> scaling for import.

* Increase final layerdrop prob from 0.05 to 0.075

* Fix bug where fewer layers were dropped than should be; remove unnecesary print statement.

* Fix bug in choosing layers to drop

* Refactor RelPosMultiheadAttention to have 2nd forward function and introduce more modules in conformer encoder layer

* Reduce final layerdrop_prob from 0.075 to 0.05.

* Fix issue with diagnostics if stats is None

* Remove persistent attention scores.

* Make ActivationBalancer and MaxEig more efficient.

* Cosmetic improvements

* Change scale_factor_scale from 0.5 to 0.8

* Make the ActivationBalancer regress to the data mean, not zero, when enforcing abs constraint.

* Remove unused config value

* Fix bug when channel_dim < 0

* Fix bug when channel_dim < 0

* Simplify how the positional-embedding scores work in attention (thanks to Zengwei for this concept)

* Revert dropout on attention scores to 0.0.

* This should just be a cosmetic change, regularizing how we get the warmup times from the layers.

* Reduce beta from 0.75 to  0.0.

* Reduce stats period from 10 to 4.

* Reworking of ActivationBalancer code to hopefully balance speed and effectiveness.

* Add debug code for attention weihts and eigs

* Remove debug statement

* Add different debug info.

* Penalize attention-weight entropies above a limit.

* Remove debug statements

* use larger delta but only penalize if small grad norm

* Bug fixes; change debug freq

* Change cutoff for small_grad_norm

* Implement whitening of values in conformer.

* Also whiten the keys in conformer.

* Fix an issue with scaling of grad.

* Decrease whitening limit from 2.0 to 1.1.

* Fix debug stats.

* Reorganize Whiten() code; configs are not the same as before.  Also remove MaxEig for self_attn module

* Bug fix RE float16

* Revert whitening_limit from 1.1 to 2.2.

* Replace MaxEig with Whiten with limit=5.0, and move it to end of ConformerEncoderLayer

* Change LR schedule to start off higher

* Simplify the dropout mask, no non-dropped-out sequences

* Make attention dims configurable, not embed_dim//2, trying 256.

* Reduce attention_dim to 192; cherry-pick scaled_adam_exp130 which is linear_pos interacting with query

* Use half the dim for values, vs. keys and queries.

* Increase initial-lr from 0.04 to 0.05, plus changes for diagnostics

* Cosmetic changes

* Changes to avoid bug in backward hooks, affecting diagnostics.

* Random clip attention scores to -5..5.

* Add some random clamping in model.py

* Add reflect=0.1 to invocations of random_clamp()

* Remove in_balancer.

* Revert model.py so there are no constraints on the output.

* Implement randomized backprop for softmax.

* Reduce min_abs from 1e-03 to 1e-04

* Add RandomGrad with min_abs=1.0e-04

* Use full precision to do softmax and store ans.

* Fix bug in backprop of random_clamp()

* Get the randomized backprop for softmax in autocast mode working.

* Remove debug print

* Reduce min_abs from 1.0e-04 to 5.0e-06

* Add hard limit of attention weights to +- 50

* Use normal implementation of softmax.

* Remove use of RandomGrad

* Remove the use of random_clamp in conformer.py.

* Reduce the limit on attention weights from 50 to 25.

* Reduce min_prob of ActivationBalancer from 0.1 to 0.05.

* Penalize too large weights in softmax of AttentionDownsample()

* Also apply limit on logit in SimpleCombiner

* Increase limit on logit for SimpleCombiner to 25.0

* Add more diagnostics to debug gradient scale problems

* Changes to grad scale logging; increase grad scale more frequently if less than one.

* Add logging

* Remove comparison diagnostics, which were not that useful.

* Configuration changes: scores limit 5->10, min_prob 0.05->0.1, cur_grad_scale more aggressive increase

* Reset optimizer state when we change loss function definition.

* Make warmup period decrease scale on simple loss, leaving pruned loss scale constant.

* Cosmetic change

* Increase initial-lr from 0.05 to 0.06.

* Increase initial-lr from 0.06 to 0.075 and decrease lr-epochs from 3.5 to 3.

* Fixes to logging statements.

* Introduce warmup schedule in optimizer

* Increase grad_scale to Whiten module

* Add inf check hooks

* Renaming in optim.py; remove step() from scan_pessimistic_batches_for_oom in train.py

* Change base lr to 0.1, also rename from initial lr in train.py

* Adding activation balancers after simple_am_prob and simple_lm_prob

* Reduce max_abs on am_balancer

* Increase max_factor in final lm_balancer and am_balancer

* Use penalize_abs_values_gt, not ActivationBalancer.

* Trying to reduce grad_scale of Whiten() from  0.02 to 0.01.

* Add hooks.py, had negleted to  git add it.

* don't do penalize_values_gt on simple_lm_proj and simple_am_proj; reduce --base-lr from 0.1 to  0.075

* Increase probs of activation balancer and make it decay slower.

* Dont print out full non-finite tensor

* Increase default max_factor for ActivationBalancer from 0.02 to 0.04; decrease max_abs in ConvolutionModule.deriv_balancer2 from 100.0 to 20.0

* reduce initial scale in GradScaler

* Increase max_abs in ActivationBalancer of conv module from 20 to 50

* --base-lr0.075->0.5; --lr-epochs 3->3.5

* Revert 179->180 change, i.e. change max_abs for deriv_balancer2 back from 50.0 20.0

* Save some memory in the autograd of DoubleSwish.

* Change the discretization of the sigmoid to be expectation preserving.

* Fix randn to rand

* Try a more exact way to round to uint8 that should prevent ever wrapping around to zero

* Make it use float16 if in amp but use clamp to avoid wrapping error

* Store only half precision output for softmax.

* More memory efficient backprop for DoubleSwish.

* Change to warmup schedule.

* Changes to more accurately estimate OOM conditions

* Reduce cutoff from 100 to 5 for estimating OOM with warmup

* Make 20 the limit for warmup_count

* Cast to float16 in DoubleSwish forward

* Hopefully make penalize_abs_values_gt more memory efficient.

* Add logging about memory used.

* Change scalar_max in optim.py from 2.0 to 5.0

* Regularize how we apply the min and max to the eps of BasicNorm

* Fix clamping of bypass scale; remove a couple unused variables.

* Increase floor on bypass_scale from 0.1 to 0.2.

* Increase bypass_scale from 0.2 to 0.4.

* Increase bypass_scale min from 0.4 to 0.5

* Rename conformer.py to zipformer.py

* Rename Conformer to Zipformer

* Update decode.py by copying from pruned_transducer_stateless5 and changing directory name

* Remove some unused variables.

* Fix clamping of epsilon

* Refactor zipformer for more flexibility so we can change number of encoder layers.

* Have a 3rd encoder, at downsampling factor of 8.

* Refactor how the downsampling is done so that it happens later, but the 1st encoder stack still operates after a subsampling of 2.

* Fix bug RE seq lengths

* Have 4 encoder stacks

* Have 6 different encoder stacks, U-shaped network.

* Reduce dim of linear positional encoding in attention layers.

* Reduce min of bypass_scale from 0.5 to 0.3, and make it not applied in test mode.

* Tuning change to num encoder layers, inspired by relative param importance.

* Make decoder group size equal to 4.

* Add skip connections as in normal U-net

* Avoid falling off the loop for weird inputs

* Apply layer-skip dropout prob

* Have warmup schedule for layer-skipping

* Rework how warmup count is produced; should not affect results.

* Add warmup schedule for zipformer encoder layer, from 1.0 -> 0.2.

* Reduce initial clamp_min for bypass_scale from 1.0 to 0.5.

* Restore the changes from scaled_adam_219 and scaled_adam_exp220,  accidentally lost, re layer skipping

* Change to schedule of bypass_scale min: make it larger, decrease slower.

* Change schedule after initial loss not promising

* Implement pooling module, add it after initial feedforward.

* Bug fix

* Introduce dropout rate to dynamic submodules of conformer.

* Introduce minimum probs in the SimpleCombiner

* Add bias in weight module

* Remove dynamic weights in SimpleCombine

* Remove the 5th of 6 encoder stacks

* Fix some typos

* small fixes

* small fixes

* Copy files

* Update decode.py

* Add changes from the master

* Add changes from the master

* update results

* Add CI

* Small fixes

* Small fixes

Co-authored-by: Daniel Povey <dpovey@gmail.com>
2022-11-15 16:56:05 +08:00

3.4 KiB

Introduction

Please refer to https://icefall.readthedocs.io/en/latest/recipes/librispeech/index.html for how to run models in this recipe.

./RESULTS.md contains the latest results.

Transducers

There are various folders containing the name transducer in this folder. The following table lists the differences among them.

Encoder Decoder Comment
transducer Conformer LSTM
transducer_stateless Conformer Embedding + Conv1d Using optimized_transducer from computing RNN-T loss
transducer_stateless2 Conformer Embedding + Conv1d Using torchaudio for computing RNN-T loss
transducer_lstm LSTM LSTM
transducer_stateless_multi_datasets Conformer Embedding + Conv1d Using data from GigaSpeech as extra training data
pruned_transducer_stateless Conformer Embedding + Conv1d Using k2 pruned RNN-T loss
pruned_transducer_stateless2 Conformer(modified) Embedding + Conv1d Using k2 pruned RNN-T loss
pruned_transducer_stateless3 Conformer(modified) Embedding + Conv1d Using k2 pruned RNN-T loss + using GigaSpeech as extra training data
pruned_transducer_stateless4 Conformer(modified) Embedding + Conv1d same as pruned_transducer_stateless2 + save averaged models periodically during training
pruned_transducer_stateless5 Conformer(modified) Embedding + Conv1d same as pruned_transducer_stateless4 + more layers + random combiner
pruned_transducer_stateless6 Conformer(modified) Embedding + Conv1d same as pruned_transducer_stateless4 + distillation with hubert
pruned_transducer_stateless7 Zipformer Embedding + Conv1d First experiment with Zipformer from Dan
pruned_transducer_stateless8 Zipformer Embedding + Conv1d Same as pruned_transducer_stateless7, but using extra data from GigaSpeech
pruned_stateless_emformer_rnnt2 Emformer(from torchaudio) Embedding + Conv1d Using Emformer from torchaudio for streaming ASR
conv_emformer_transducer_stateless ConvEmformer Embedding + Conv1d Using ConvEmformer for streaming ASR + mechanisms in reworked model
conv_emformer_transducer_stateless2 ConvEmformer Embedding + Conv1d Using ConvEmformer with simplified memory for streaming ASR + mechanisms in reworked model
lstm_transducer_stateless LSTM Embedding + Conv1d Using LSTM with mechanisms in reworked model
lstm_transducer_stateless2 LSTM Embedding + Conv1d Using LSTM with mechanisms in reworked model + gigaspeech (multi-dataset setup)

The decoder in transducer_stateless is modified from the paper Rnn-Transducer with Stateless Prediction Network. We place an additional Conv1d layer right after the input embedding layer.