环境搭建
需要使用cuda
在 cmd 控制台里输入 nvidia-smi.exe
以查看显卡驱动版本和对应的 cuda 版本
前往 NVIDIA-CUDA 官网下载与系统对应的 Cuda 版本
以 Cuda-11.7 版本为例,根据自己的系统和需求选择安装(一般本地 Windows 用户请依次选择Windows
, x86_64
, 系统版本
, exe(local)
)
安装成功之后在 cmd 控制台中输入nvcc -V
, 出现类似以下内容则安装成功:
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu117
查看是否成功调用,输出True即可
python # 回车运行 import torch # 回车运行 print(torch.cuda.is_available()) # 回车运行
安装 Fastwhisper
pip install faster-whisper
下载模型
silero-vad
下载模型
具体实现
思路就是pyaudio循环录制,silero-vad检测是否有人说话,有人说话则将音频保存转录
import threading import wave import numpy as np import pyaudio from faster_whisper import WhisperModel import torch def int2float(sound): abs_max = np.abs(sound).max() sound = sound.astype('float32') if abs_max > 0: sound *= 1 / 32768 sound = sound.squeeze() return sound def save_audio(audio): with wave.open('output.wav', 'wb') as wf: wf.setnchannels(1) wf.setsampwidth(2) wf.setframerate(16000) wf.writeframes(audio) def audio2Text(audio): result = None segments, info = whisperModel.transcribe(audio, beam_size=5, language="zh") for segment in segments: if result is None: result = segment.text else: result += ", " + segment.text print(result) if __name__ == '__main__': model, utils = torch.hub.load( repo_or_dir='../../silero-vad', model='silero_vad', trust_repo=None, source='local', ) whisperModel = WhisperModel("../../large-v2", device="cuda", compute_type="float16") (get_speech_timestamps, save_audio, read_audio, VADIterator, collect_chunks) = utils FORMAT = pyaudio.paInt16 CHANNELS = 1 SAMPLE_RATE = 16000 num_samples = 8192 audio = pyaudio.PyAudio() stream = audio.open(format=FORMAT, channels=CHANNELS, rate=SAMPLE_RATE, input=True, frames_per_buffer=8192) data = [] print("Started Recording") audio = None countSize = 0 while True: audio_chunk = stream.read(num_samples) audio_int16 = np.frombuffer(audio_chunk, np.int16) audio_float32 = int2float(audio_int16) new_confidence = model(torch.from_numpy(audio_float32), 16000).item() if new_confidence > 0.5: if audio is None: audio = audio_chunk countSize = 0 else: audio = audio + audio_chunk countSize = 0 else: countSize = countSize + 1 if audio is not None and countSize < 3: audio = audio + audio_chunk elif audio is not None and countSize > 3: save_audio(audio) t = threading.Thread(target=audio2Text(int2float(np.frombuffer(audio, np.int16))), name='LoopThread') t.start() audio = None countSize = 0
一些笔记
Fastwhisper支持输入为文件地址,binaryio,numpy数组,但pyaduio录音直接转成binaryio会无法转录,只能使用numpy,会有一定精度损失问题
时间戳未实现,似乎可以自己维护一个内置的时间来计算
whispernumpypytorchwindowsidepython时间戳etftpunvcurlframer