* support consistency-regularized CTC
* update arguments of cr-ctc
* set default value of cr_loss_masked_scale to 1.0
* minor fix
* refactor codes
* update RESULTS.md
* add CTC loss option in zipformer recipe
* add ctc_decode.py
* support CTC model export, add jit_pretrained_ctc.py, pretrained_ctc.py
* update README.md and RESULTS.md
* add CI test
* add the zipformer codes, copied from branch from_dan_scaled_adam_exp1119
* support model export with torch.jit.script
* update RESULTS.md
* support exporting streaming model with torch.jit.script
* add results of streaming models, with some minor changes
* update README.md
* add CI test
* update k2 version in requirements-ci.txt
* update pyproject.toml
* init files
* add ctc as auxiliary loss and ctc_decode.py
* tuning the scalar of HLG score for 1best, nbest and nbest-oracle
* rename to pruned_transducer_stateless7_ctc
* fix doc
* fix bug, recover the hlg scores
* modify ctc_decode.py, move out the hlg scale
* fix hlg_scale
* add export.py and pretrained.py, and so on
* upload files, update README.md and RESULTS.md
* add CI test
* Bug fix
* Change subsamplling factor from 1 to 2
* Implement AttentionCombine as replacement for RandomCombine
* Decrease random_prob from 0.5 to 0.333
* Add print statement
* Apply single_prob mask, so sometimes we just get one layer as output.
* Introduce feature mask per frame
* Include changes from Liyong about padding conformer module.
* Reduce single_prob from 0.5 to 0.25
* Reduce feature_mask_dropout_prob from 0.25 to 0.15.
* Remove dropout from inside ConformerEncoderLayer, for adding to residuals
* Increase feature_mask_dropout_prob from 0.15 to 0.2.
* Swap random_prob and single_prob, to reduce prob of being randomized.
* Decrease feature_mask_dropout_prob back from 0.2 to 0.15, i.e. revert the 43->48 change.
* Randomize order of some modules
* Bug fix
* Stop backprop bug
* Introduce a scale dependent on the masking value
* Implement efficient layer dropout
* Simplify the learned scaling factor on the modules
* Compute valid loss on batch 0.
* Make the scaling factors more global and the randomness of dropout more random
* Bug fix
* Introduce offset in layerdrop_scaleS
* Remove final combination; implement layer drop that drops the final layers.
* Bug fices
* Fix bug RE self.training
* Fix bug setting layerdrop mask
* Fix eigs call
* Add debug info
* Remove warmup
* Remove layer dropout and model-level warmup
* Don't always apply the frame mask
* Slight code cleanup/simplification
* Various fixes, finish implementating frame masking
* Remove debug info
* Don't compute validation if printing diagnostics.
* Apply layer bypass during warmup in a new way, including 2s and 4s of layers.
* Update checkpoint.py to deal with int params
* Revert initial_scale to previous values.
* Remove the feature where it was bypassing groups of layers.
* Implement layer dropout with probability 0.075
* Fix issue with warmup in test time
* Add warmup schedule where dropout disappears from earlier layers first.
* Have warmup that gradually removes dropout from layers; multiply initialization scales by 0.1.
* Do dropout a different way
* Fix bug in warmup
* Remove debug print
* Make the warmup mask per frame.
* Implement layer dropout (in a relatively efficient way)
* Decrease initial keep_prob to 0.25.
* Make it start warming up from the very start, and increase warmup_batches to 6k
* Change warmup schedule and increase warmup_batches from 4k to 6k
* Make the bypass scale trainable.
* Change the initial keep-prob back from 0.25 to 0.5
* Bug fix
* Limit bypass scale to >= 0.1
* Revert "Change warmup schedule and increase warmup_batches from 4k to 6k"
This reverts commit 86845bd5d859ceb6f83cd83f3719c3e6641de987.
* Do warmup by dropping out whole layers.
* Decrease frequency of logging variance_proportion
* Make layerdrop different in different processes.
* For speed, drop the same num layers per job.
* Decrease initial_layerdrop_prob from 0.75 to 0.5
* Revert also the changes in scaled_adam_exp85 regarding warmup schedule
* Remove unused code LearnedScale.
* Reintroduce batching to the optimizer
* Various fixes from debugging with nvtx, but removed the NVTX annotations.
* Only apply ActivationBalancer with prob 0.25.
* Fix s -> scaling for import.
* Increase final layerdrop prob from 0.05 to 0.075
* Fix bug where fewer layers were dropped than should be; remove unnecesary print statement.
* Fix bug in choosing layers to drop
* Refactor RelPosMultiheadAttention to have 2nd forward function and introduce more modules in conformer encoder layer
* Reduce final layerdrop_prob from 0.075 to 0.05.
* Fix issue with diagnostics if stats is None
* Remove persistent attention scores.
* Make ActivationBalancer and MaxEig more efficient.
* Cosmetic improvements
* Change scale_factor_scale from 0.5 to 0.8
* Make the ActivationBalancer regress to the data mean, not zero, when enforcing abs constraint.
* Remove unused config value
* Fix bug when channel_dim < 0
* Fix bug when channel_dim < 0
* Simplify how the positional-embedding scores work in attention (thanks to Zengwei for this concept)
* Revert dropout on attention scores to 0.0.
* This should just be a cosmetic change, regularizing how we get the warmup times from the layers.
* Reduce beta from 0.75 to 0.0.
* Reduce stats period from 10 to 4.
* Reworking of ActivationBalancer code to hopefully balance speed and effectiveness.
* Add debug code for attention weihts and eigs
* Remove debug statement
* Add different debug info.
* Penalize attention-weight entropies above a limit.
* Remove debug statements
* use larger delta but only penalize if small grad norm
* Bug fixes; change debug freq
* Change cutoff for small_grad_norm
* Implement whitening of values in conformer.
* Also whiten the keys in conformer.
* Fix an issue with scaling of grad.
* Decrease whitening limit from 2.0 to 1.1.
* Fix debug stats.
* Reorganize Whiten() code; configs are not the same as before. Also remove MaxEig for self_attn module
* Bug fix RE float16
* Revert whitening_limit from 1.1 to 2.2.
* Replace MaxEig with Whiten with limit=5.0, and move it to end of ConformerEncoderLayer
* Change LR schedule to start off higher
* Simplify the dropout mask, no non-dropped-out sequences
* Make attention dims configurable, not embed_dim//2, trying 256.
* Reduce attention_dim to 192; cherry-pick scaled_adam_exp130 which is linear_pos interacting with query
* Use half the dim for values, vs. keys and queries.
* Increase initial-lr from 0.04 to 0.05, plus changes for diagnostics
* Cosmetic changes
* Changes to avoid bug in backward hooks, affecting diagnostics.
* Random clip attention scores to -5..5.
* Add some random clamping in model.py
* Add reflect=0.1 to invocations of random_clamp()
* Remove in_balancer.
* Revert model.py so there are no constraints on the output.
* Implement randomized backprop for softmax.
* Reduce min_abs from 1e-03 to 1e-04
* Add RandomGrad with min_abs=1.0e-04
* Use full precision to do softmax and store ans.
* Fix bug in backprop of random_clamp()
* Get the randomized backprop for softmax in autocast mode working.
* Remove debug print
* Reduce min_abs from 1.0e-04 to 5.0e-06
* Add hard limit of attention weights to +- 50
* Use normal implementation of softmax.
* Remove use of RandomGrad
* Remove the use of random_clamp in conformer.py.
* Reduce the limit on attention weights from 50 to 25.
* Reduce min_prob of ActivationBalancer from 0.1 to 0.05.
* Penalize too large weights in softmax of AttentionDownsample()
* Also apply limit on logit in SimpleCombiner
* Increase limit on logit for SimpleCombiner to 25.0
* Add more diagnostics to debug gradient scale problems
* Changes to grad scale logging; increase grad scale more frequently if less than one.
* Add logging
* Remove comparison diagnostics, which were not that useful.
* Configuration changes: scores limit 5->10, min_prob 0.05->0.1, cur_grad_scale more aggressive increase
* Reset optimizer state when we change loss function definition.
* Make warmup period decrease scale on simple loss, leaving pruned loss scale constant.
* Cosmetic change
* Increase initial-lr from 0.05 to 0.06.
* Increase initial-lr from 0.06 to 0.075 and decrease lr-epochs from 3.5 to 3.
* Fixes to logging statements.
* Introduce warmup schedule in optimizer
* Increase grad_scale to Whiten module
* Add inf check hooks
* Renaming in optim.py; remove step() from scan_pessimistic_batches_for_oom in train.py
* Change base lr to 0.1, also rename from initial lr in train.py
* Adding activation balancers after simple_am_prob and simple_lm_prob
* Reduce max_abs on am_balancer
* Increase max_factor in final lm_balancer and am_balancer
* Use penalize_abs_values_gt, not ActivationBalancer.
* Trying to reduce grad_scale of Whiten() from 0.02 to 0.01.
* Add hooks.py, had negleted to git add it.
* don't do penalize_values_gt on simple_lm_proj and simple_am_proj; reduce --base-lr from 0.1 to 0.075
* Increase probs of activation balancer and make it decay slower.
* Dont print out full non-finite tensor
* Increase default max_factor for ActivationBalancer from 0.02 to 0.04; decrease max_abs in ConvolutionModule.deriv_balancer2 from 100.0 to 20.0
* reduce initial scale in GradScaler
* Increase max_abs in ActivationBalancer of conv module from 20 to 50
* --base-lr0.075->0.5; --lr-epochs 3->3.5
* Revert 179->180 change, i.e. change max_abs for deriv_balancer2 back from 50.0 20.0
* Save some memory in the autograd of DoubleSwish.
* Change the discretization of the sigmoid to be expectation preserving.
* Fix randn to rand
* Try a more exact way to round to uint8 that should prevent ever wrapping around to zero
* Make it use float16 if in amp but use clamp to avoid wrapping error
* Store only half precision output for softmax.
* More memory efficient backprop for DoubleSwish.
* Change to warmup schedule.
* Changes to more accurately estimate OOM conditions
* Reduce cutoff from 100 to 5 for estimating OOM with warmup
* Make 20 the limit for warmup_count
* Cast to float16 in DoubleSwish forward
* Hopefully make penalize_abs_values_gt more memory efficient.
* Add logging about memory used.
* Change scalar_max in optim.py from 2.0 to 5.0
* Regularize how we apply the min and max to the eps of BasicNorm
* Fix clamping of bypass scale; remove a couple unused variables.
* Increase floor on bypass_scale from 0.1 to 0.2.
* Increase bypass_scale from 0.2 to 0.4.
* Increase bypass_scale min from 0.4 to 0.5
* Rename conformer.py to zipformer.py
* Rename Conformer to Zipformer
* Update decode.py by copying from pruned_transducer_stateless5 and changing directory name
* Remove some unused variables.
* Fix clamping of epsilon
* Refactor zipformer for more flexibility so we can change number of encoder layers.
* Have a 3rd encoder, at downsampling factor of 8.
* Refactor how the downsampling is done so that it happens later, but the 1st encoder stack still operates after a subsampling of 2.
* Fix bug RE seq lengths
* Have 4 encoder stacks
* Have 6 different encoder stacks, U-shaped network.
* Reduce dim of linear positional encoding in attention layers.
* Reduce min of bypass_scale from 0.5 to 0.3, and make it not applied in test mode.
* Tuning change to num encoder layers, inspired by relative param importance.
* Make decoder group size equal to 4.
* Add skip connections as in normal U-net
* Avoid falling off the loop for weird inputs
* Apply layer-skip dropout prob
* Have warmup schedule for layer-skipping
* Rework how warmup count is produced; should not affect results.
* Add warmup schedule for zipformer encoder layer, from 1.0 -> 0.2.
* Reduce initial clamp_min for bypass_scale from 1.0 to 0.5.
* Restore the changes from scaled_adam_219 and scaled_adam_exp220, accidentally lost, re layer skipping
* Change to schedule of bypass_scale min: make it larger, decrease slower.
* Change schedule after initial loss not promising
* Implement pooling module, add it after initial feedforward.
* Bug fix
* Introduce dropout rate to dynamic submodules of conformer.
* Introduce minimum probs in the SimpleCombiner
* Add bias in weight module
* Remove dynamic weights in SimpleCombine
* Remove the 5th of 6 encoder stacks
* Fix some typos
* small fixes
* small fixes
* Copy files
* Update decode.py
* Add changes from the master
* Add changes from the master
* update results
* Add CI
* Small fixes
* Small fixes
Co-authored-by: Daniel Povey <dpovey@gmail.com>
* add ScaledLSTM
* add RNNEncoderLayer and RNNEncoder classes in lstm.py
* add RNN and Conv2dSubsampling classes in lstm.py
* hardcode bidirectional=False
* link from pruned_transducer_stateless2
* link scaling.py pruned_transducer_stateless2
* copy from pruned_transducer_stateless2
* modify decode.py pretrained.py test_model.py train.py
* copy streaming decoding files from pruned_transducer_stateless2
* modify streaming decoding files
* simplified code in ScaledLSTM
* flat weights after scaling
* pruned2 -> pruned4
* link __init__.py
* fix style
* remove add_model_arguments
* modify .flake8
* fix style
* fix scale value in scaling.py
* add random combiner for training deeper model
* add using proj_size
* add scaling converter for ScaledLSTM
* support jit trace
* add using averaged model in export.py
* modify test_model.py, test if the model can be successfully exported by jit.trace
* modify pretrained.py
* support streaming decoding
* fix model.py
* Add cut_id to recognition results
* Add cut_id to recognition results
* do not pad in Conv subsampling module; add tail padding during decoding.
* update RESULTS.md
* minor fix
* fix doc
* update README.md
* minor change, filter infinite loss
* remove the condition of raise error
* modify type hint for the return value in model.py
* minor change
* modify RESULTS.md
Co-authored-by: pkufool <wkang.pku@gmail.com>
* init files
* use average value as memory vector for each chunk
* change tail padding length from right_context_length to chunk_length
* correct the files, ln -> cp
* fix bug in conv_emformer_transducer_stateless2/emformer.py
* fix doc in conv_emformer_transducer_stateless/emformer.py
* refactor init states for stream
* modify .flake8
* fix bug about memory mask when memory_size==0
* add @torch.jit.export for init_states function
* update RESULTS.md
* minor change
* update README.md
* modify doc
* replace torch.div() with <<
* fix bug, >> -> <<
* use i&i-1 to judge if it is a power of 2
* minor fix
* fix error in RESULTS.md
* copy files from existing branch
* add rule in .flake8
* monir style fix
* fix typos
* add tail padding
* refactor, use fixed-length cache for batch decoding
* copy from streaming branch
* copy from streaming branch
* modify emformer states stack and unstack, streaming decoding, to be continued
* refactor Stream class
* remane streaming_feature_extractor.py
* refactor streaming decoding
* test states stack and unstack
* fix bugs, no grad, and num_proccessed_frames
* add modify_beam_search, fast_beam_search
* support torch.jit.export
* use torch.div
* copy from pruned_transducer_stateless4
* modify export.py
* add author info
* delete other test functions
* minor fix
* modify doc
* fix style
* minor fix doc
* minor fix
* minor fix doc
* update RESULTS.md
* fix typo
* add info
* fix typo
* fix doc
* add test function for conv module, and minor fix.
* add copyright info
* minor change of test_emformer.py
* fix doc of stack and unstack, test case with batch_size=1
* update README.md
* Copy files for editing.
* Add random combine from #229.
* Minor fixes.
* Pass model parameters from the command line.
* Fix warnings.
* Fix warnings.
* Update readme.
* Rename to avoid conflicts.
* Update results.
* Add CI for pruned_transducer_stateless5
* Typo fixes.
* Remove random combiner.
* Update decode.py and train.py to use periodically averaged models.
* Minor fixes.
* Revert to use random combiner.
* Update results.
* Minor fixes.
* Copy files for editing.
* Use librispeech + gigaspeech with modified conformer.
* Support specifying number of workers for on-the-fly feature extraction.
* Feature extraction code for GigaSpeech.
* Combine XL splits lazily during training.
* Fix warnings in decoding.
* Add decoding code for GigaSpeech.
* Fix decoding the gigaspeech dataset.
We have to use the decoder/joiner networks for the GigaSpeech dataset.
* Disable speed perturbe for XL subset.
* Compute the Nbest oracle WER for RNN-T decoding.
* Minor fixes.
* Minor fixes.
* Add results.
* Update results.
* Update CI.
* Update results.
* Fix style issues.
* Update results.
* Fix style issues.
* Add modified beam search for pruned rnn-t.
* Fix style issues.
* Update RESULTS.md.
* Fix typos.
* Minor fixes.
* Test the pre-trained model using GitHub actions.
* Let the user install optimized_transducer on her own.
* Fix errors in GitHub CI.
* Begin to use multiple datasets.
* Finish preparing training datasets.
* Minor fixes
* Copy files.
* Finish training code.
* Display losses for gigaspeech and librispeech separately.
* Fix decode.py
* Make the probability to select a batch from GigaSpeech configurable.
* Update results.
* Minor fixes.