Faster-whisper+silero-vad 实时语音转录

SEO教程

正在检查是否收录...

环境搭建

需要使用cuda

在 cmd 控制台里输入 nvidia-smi.exe 以查看显卡驱动版本和对应的 cuda 版本

前往 NVIDIA-CUDA 官网下载与系统对应的 Cuda 版本

以 Cuda-11.7 版本为例，根据自己的系统和需求选择安装（一般本地 Windows 用户请依次选择Windows, x86_64, 系统版本, exe(local)）

安装成功之后在 cmd 控制台中输入nvcc -V, 出现类似以下内容则安装成功：

pytorch官网查看cuda对应版本，如下给出cuda11.7的

pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu117

查看是否成功调用，输出True即可

python # 回车运行 import torch # 回车运行 print(torch.cuda.is_available()) # 回车运行

安装 Fastwhisper

pip install faster-whisper

下载模型

silero-vad

下载模型

具体实现

思路就是pyaudio循环录制，silero-vad检测是否有人说话，有人说话则将音频保存转录

import threading import wave import numpy as np import pyaudio from faster_whisper import WhisperModel import torch def int2float(sound): abs_max = np.abs(sound).max() sound = sound.astype('float32') if abs_max > 0: sound *= 1 / 32768 sound = sound.squeeze() return sound def save_audio(audio): with wave.open('output.wav', 'wb') as wf: wf.setnchannels(1) wf.setsampwidth(2) wf.setframerate(16000) wf.writeframes(audio) def audio2Text(audio): result = None segments, info = whisperModel.transcribe(audio, beam_size=5, language="zh") for segment in segments: if result is None: result = segment.text else: result += ", " + segment.text print(result) if __name__ == '__main__': model, utils = torch.hub.load( repo_or_dir='../../silero-vad', model='silero_vad', trust_repo=None, source='local', ) whisperModel = WhisperModel("../../large-v2", device="cuda", compute_type="float16") (get_speech_timestamps, save_audio, read_audio, VADIterator, collect_chunks) = utils FORMAT = pyaudio.paInt16 CHANNELS = 1 SAMPLE_RATE = 16000 num_samples = 8192 audio = pyaudio.PyAudio() stream = audio.open(format=FORMAT, channels=CHANNELS, rate=SAMPLE_RATE, input=True, frames_per_buffer=8192) data = [] print("Started Recording") audio = None countSize = 0 while True: audio_chunk = stream.read(num_samples) audio_int16 = np.frombuffer(audio_chunk, np.int16) audio_float32 = int2float(audio_int16) new_confidence = model(torch.from_numpy(audio_float32), 16000).item() if new_confidence > 0.5: if audio is None: audio = audio_chunk countSize = 0 else: audio = audio + audio_chunk countSize = 0 else: countSize = countSize + 1 if audio is not None and countSize < 3: audio = audio + audio_chunk elif audio is not None and countSize > 3: save_audio(audio) t = threading.Thread(target=audio2Text(int2float(np.frombuffer(audio, np.int16))), name='LoopThread') t.start() audio = None countSize = 0