Whisper | WAP站长网

Whisper

正在检查是否收录...

文章目录

使后感 Paper Review 个人觉得有趣的 Log Mel spectrogram & STFT Training cross-attention输入 cross-attention输出 positional encoding 数据 Decoding 为什么可以有时间戳的信息 Model Encoder Decoder 时间戳一小句的时间戳一个单词的时间戳 Test code QKV attention Text token 里关于positional_embedding 的 offset Faster Whisper VAD

使后感

因为运用里需要考虑到时效和准确性，类似于YOLO，只考虑 tiny, base,和small 的模型。准确率基本反应了模型的大小，即越大的模型有越高的准确率

Paper Review

个人觉得有趣的

这里的feature不是直接的声音array，但log-mel spectrogram 也不是陌生的。mel 比 STFT更少的特征数量，也更接近人类感知，Mel 频谱通过在较低频率提供更多的分辨率，有助于减少背景噪音的影响。

整个结构也是很一目了然，喜闻乐见的transformer。但是有限制： 16，000Hz的audio sample， 80 channels，25 millisseconds的窗口，移动距离为 10 milliseconds

为啥可以得到时间轴对应的Txt, 这个得感谢decoding.py 里 “begin time” 和 “end time”

faster whisper 和原生的whisper都是用的基于MEL的。对STFT的优化都采用了hann_window，只不过faster whisper是numpy，原生的是torch。这里的处理套路两个一样。

Log Mel spectrogram & STFT

import numpy as np import librosa import librosa.display import matplotlib.pyplot as plt # 加载音频文件 audio_path = 'your_audio_file.wav' y, sr = librosa.load(audio_path) # 计算 STFT D = librosa.stft(y) # 将功率谱转换为dB D_dB = librosa.amplitude_to_db(np.abs(D), ref=np.max) # 创建 Mel 滤波器组 n_mels = 128 mel_filter = librosa.filters.mel(sr, n_fft=D.shape[0], n_mels=n_mels) # 应用 Mel 滤波器组 mel_S = np.dot(mel_filter, np.abs(D)) # 对 Mel 频谱取对数 log_mel_S = librosa.power_to_db(mel_S, ref=np.max) # 绘图 plt.figure(figsize=(12, 8)) plt.subplot(2, 1, 1) librosa.display.specshow(D_dB, sr=sr, x_axis='time', y_axis='log') plt.title('STFT Power Spectrogram') plt.colorbar(format='%+2.0f dB') plt.subplot(2, 1, 2) librosa.display.specshow(log_mel_S, sr=sr, x_axis='time', y_axis='mel') plt.title('Log-Mel Spectrogram') plt.colorbar(format='%+2.0f dB') plt.tight_layout() plt.show()

Training

cross-attention输入

SOT： start of trascription token
EN: English token
TRANS-CRIBE: token
timestamp
balabalabala(真的语音转成的文字)

cross-attention输出

EN: English token
TRANS-CRIBE: token
timestamp
balabalabala(真的语音转成的文字)

positional encoding

在这里面用到了不同的positional encoding，只是不确定如果不一样会不会有什么影响。挖个坑先（后面把这里填了）
输入用的是Sinusoidal Positional Encoding
输出用的是 Learned Positional Encoding

数据

基本是需要人工参与去检查大数据里的数据质量的（后期有通过使用初训的Whisper过筛数据后加人工检查的操作） Whisper还有减翅膀的悲剧（哭哭），本来有显示出可以“猜”说话的人，但是这个应该和NLP大模型里面的“想象力”一样，都是瞎猜，为了减少其影响，后来在fine-tune 时把这个信息从训练里删除了也有比较有趣的是，Speech reg 是依靠WER(word error rate)来的，也就是非常粗暴的word edit distance. 那每个人讲话啊风格不一样，就算是同一个意思的数据也会因为WER过高，导致训练了个寂寞。 hmmmmm…这个数据处理相当heavily depends on manually inspection. 如果资金不够。。。真的很尴尬在normalise.py 和 paper最后，给了一堆normalization的tricks