英文視頻字幕自動生成


英文視頻字幕自動生成


筆者正在製作 ,發現有大量的英文視頻資料值得學習,但是視頻中缺少字幕,可能會對學生的學習過程帶來困擾。如果能夠得到英文字幕,再通過谷歌翻譯等工具的使用,就可以進一步生成中文字幕。因此,開始探索快速生成字幕的方法,本文對實現過程進行記錄,筆者的計算機使用的是Windows 10 64位操作系統。

:需要計算機通過某些方法成功訪問谷歌!

整體流程可劃分為:

  • 安裝Python2
  • 下載配置ffmepg
  • 下載並修改autosub
  • 運行命令生成視頻字幕

下面依次進行展開:

  1. 安裝Python2

這裡需要安裝Python2,因為後續要調用autosub,而autosub是用Python2編寫的。筆者試過使用Python3,可能會需要大量的改動才能運行,最終還是按照autosub的說明安裝了Python2。這裡推薦下載Anaconda2進行Python2的安裝,可以省去可能存在的關聯包下載配置的麻煩。此外,autosub中提示安裝32位Python,筆者未測試64位Python是否可行。打開下載的Anaconda2(32位)會出現:

英文視頻字幕自動生成

點擊“Next”:

英文視頻字幕自動生成

選擇“I Agree”:

英文視頻字幕自動生成

筆者習慣選擇“All Users”,點擊“Next”:

英文視頻字幕自動生成

這裡推薦選擇一個不含空格和中文的文件夾進行安裝,比如“D:\\Anaconda2”:

英文視頻字幕自動生成

不建議勾選第一個選項,可能造成python不同版本間使用的混亂,比如筆者還安裝了Python3。第二項默認勾選即可:

英文視頻字幕自動生成

點擊“Next”,安裝完畢:

英文視頻字幕自動生成

  1. 下載配置ffmepg

ffmepg主要用於解析視頻內的音頻。

(1)在網站“https://ffmpeg.zeranoe.com/builds/”中,下載滿足操作系統要求的ffmepg版本,筆者下載的是ffmpeg-3.2-win64-static。

(2)將下載後的文件解壓,將文件夾重命名為ffmepg(可選),將其整體拷貝到“D:\\Anaconda2”內(拷貝到哪個目錄可以靈活掌握)。

(3)將解壓後文件內的“bin”目錄配置到系統環境變量Path中。

首先按下Win+R鍵,啟動運行窗口,輸入sysdm.cpl:

英文視頻字幕自動生成

點擊確定,打開系統控制面板,然後選擇高級標籤:

英文視頻字幕自動生成

點擊環境變量,在系統變量的窗口內,找到Path:

英文視頻字幕自動生成

選中Path環境變量,點擊編輯。選擇新建,將剛才解壓得到的“bin”文件夾所在目錄添加到空白處:

英文視頻字幕自動生成

點擊確定。到此完成ffmepg的下載和配置。

  1. 下載並修改autosub

autosub是用於自動生成字幕的工具,在語音轉寫部分調用的是Google Cloud Speech API。

(1)使用Anaconda2安裝autosub

安裝Anaconda2後,可以找到工具Anaconda Powershell Prompt (Anaconda2),打開後,輸入:

<code>pip install autosub/<code>

就可以完成autosub的安裝。

(2)重命名autosub

autosub安裝完成後文件位於“D:\\Anaconda2\\Scripts”內,將其重命名為“autosub_app.py”。

(3)修改autosub_app.py代碼

這裡對幾處重點的修改展開說明,autosub_app.py的全部代碼會在文末給出。

  • 代碼第48行,加入“, delete=False”,使臨時文件不被刪除,也就是將:
<code>temp = tempfile.NamedTemporaryFile(suffix='.flac')/<code>

修改為:

<code>temp = tempfile.NamedTemporaryFile(suffix='.flac', delete=False)/<code>
  • 代碼第127行,加入“.exe”,以保證成功地訪問到ffmepg.exe文件,也就是將:
<code>exe_file = os.path.join(path, program)/<code>

修改為:

<code>exe_file = os.path.join(path, program + ".exe")/<code> 
  • 加入proxy信息

在引入依賴包後,添加全局proxy_dict,這裡只是定義一個字典結構:

<code>proxy_dict = {
'http': 'http://127.0.0.1:8118',
'https': 'https://127.0.0.1:8118',
'use': False
}/<code>

然後修改類SpeechRecognizer,在__init__方法中加入proxy變量,在__call__方法中添加邏輯,根據命令判斷是否使用proxy,發出不同的post請求。

此外,建議在拋出requests.exceptions.ConnectionError後加入一條打印提示,否則遇到連接Google服務器異常的情況,也不會提示任何錯誤,而程序最終會獲得一個大小為0的srt字幕文件:

<code>except requests.exceptions.ConnectionError:
print "ConnectionError\\n"
continue/<code>

類SpeechRecognizer修改後如下:

<code>class SpeechRecognizer(object):
def __init__(self, language="en", rate=44100, retries=3, api_key=GOOGLE_SPEECH_API_KEY, proxy=proxy_dict):
self.language = language
self.rate = rate
self.api_key = api_key
self.retries = retries
self.proxy = proxy

def __call__(self, data):
try:
for i in range(self.retries):
url = GOOGLE_SPEECH_API_URL.format(lang=self.language, key=self.api_key)
headers = {"Content-Type": "audio/x-flac; rate=%d" % self.rate}

try:
if self.proxy['use']:
resp = requests.post(url, data=data, headers=headers, proxies=self.proxy)
else:
resp = requests.post(url, data=data, headers=headers)
except requests.exceptions.ConnectionError:
print "ConnectionError\\n"
continue

for line in resp.content.split("\\n"):
try:
line = json.loads(line)
line = line['result'][0]['alternative'][0]['transcript']
return line[:1].upper() + line[1:]
except:
# no result
continue
except KeyboardInterrupt:
return/<code>

在main方法內加入proxy參數解析代碼,這樣就可以通過命令行參數來設置proxy:

<code>parser.add_argument('-P', '--proxy', help="Set proxy server")
args = parser.parse_args()
if args.proxy:
proxy_dict.update({
'http': args.proxy,
'https': args.proxy,
'use': True
})
print("Use proxy " + args.proxy)/<code>

到此,就完成了代碼的配置過程,下面就可以通過命令行,運行程序進行字幕生成了。

  1. 運行命令生成視頻字幕

(1)代理配置信息獲取(如果使用國外網絡,此步可忽略)

首先需要找到計算機代理的配置信息。在win10下,右鍵點擊桌面右下角的網絡,然後打開“網絡和Internet”設置,點擊左側最下方的代理:

英文視頻字幕自動生成

將自動設置代理下的腳本地址,複製粘貼到瀏覽器地址欄內打開,拉到最下方部分,找到:

<code>var proxy = "PROXY 127.0.0.1:8118; DIRECT;";
var direct = 'DIRECT;';/<code>

這樣就能找到proxy的ip和端口設置,即127.0.0.1:8118。(根據不同工具的使用,這裡的ip和端口可能會不同。)

(2)字幕提取命令

這裡需要再次打開工具Anaconda Powershell Prompt (Anaconda2),將工作目錄切換至包含待提取字幕視頻的目錄內,例如D盤根目錄下有一個待提取字幕的視頻“01_HowComputersWork_sm.mp4”,首先將工作目錄切換至D盤,然後執行命令:

<code>python D:\\Anaconda2\\Scripts\\autosub_app.py -S en -D en -P http://127.0.0.1:8118 .\\01_HowComputersWork_sm.mp4/<code>

如果使用國外網絡,即可不配置-P及後面的參數,命令為:

<code>python D:\\Anaconda2\\Scripts\\autosub_app.py -S en -D en .\\01_HowComputersWork_sm.mp4/<code>

運行結果如下:

英文視頻字幕自動生成

程序運行最後可能會報WindowsError,筆者還沒有找到解決方案,但是這並不影響程序的功能,字幕已經成功生成,可以在D盤根目錄下看到“01_HowComputersWork_sm.srt”文件,打開視頻導入字幕效果如下:


英文視頻字幕自動生成

當然,自動生成的字幕有待進一步審核校驗。


autosub_app.py代碼:

<code>#!D:\\Anaconda2\\python.exe
import argparse
import audioop
from googleapiclient.discovery import build
import json
import math
import multiprocessing
import os
import requests
import subprocess
import sys
import tempfile
import wave

from progressbar import ProgressBar, Percentage, Bar, ETA

from autosub.constants import LANGUAGE_CODES, \\
GOOGLE_SPEECH_API_KEY, GOOGLE_SPEECH_API_URL
from autosub.formatters import FORMATTERS

proxy_dict = {
'http': 'http://127.0.0.1:8118',
'https': 'https://127.0.0.1:8118',
'use': False
}

def percentile(arr, percent):
arr = sorted(arr)
k = (len(arr) - 1) * percent
f = math.floor(k)
c = math.ceil(k)
if f == c: return arr[int(k)]
d0 = arr[int(f)] * (c - k)
d1 = arr[int(c)] * (k - f)
return d0 + d1


def is_same_language(lang1, lang2):
return lang1.split("-")[0] == lang2.split("-")[0]


class FLACConverter(object):
def __init__(self, source_path, include_before=0.25, include_after=0.25):
self.source_path = source_path
self.include_before = include_before
self.include_after = include_after

def __call__(self, region):
try:

start, end = region
start = max(0, start - self.include_before)
end += self.include_after
temp = tempfile.NamedTemporaryFile(suffix='.flac', delete = False)
command = ["ffmpeg","-ss", str(start), "-t", str(end - start),
"-y", "-i", self.source_path,
"-loglevel", "error", temp.name]
subprocess.check_output(command, stdin=open(os.devnull))
return temp.read()

except KeyboardInterrupt:
return

class SpeechRecognizer(object):
def __init__(self, language="en", rate=44100, retries=3, api_key=GOOGLE_SPEECH_API_KEY, proxy=proxy_dict):
self.language = language
self.rate = rate
self.api_key = api_key
self.retries = retries
self.proxy = proxy

def __call__(self, data):
try:
for i in range(self.retries):
url = GOOGLE_SPEECH_API_URL.format(lang=self.language, key=self.api_key)
headers = {"Content-Type": "audio/x-flac; rate=%d" % self.rate}
try:
if self.proxy['use']:
resp = requests.post(url, data=data, headers=headers, proxies=self.proxy)
else:
resp = requests.post(url, data=data, headers=headers)
except requests.exceptions.ConnectionError:
print "ConnectionError\\n"
continue

for line in resp.content.split("\\n"):
try:
line = json.loads(line)
line = line['result'][0]['alternative'][0]['transcript']
return line[:1].upper() + line[1:]
except:
# no result
continue
except KeyboardInterrupt:
return


class Translator(object):
def __init__(self, language, api_key, src, dst):
self.language = language

self.api_key = api_key
self.service = build('translate', 'v2',
developerKey=self.api_key)
self.src = src
self.dst = dst

def __call__(self, sentence):
try:
if not sentence: return
result = self.service.translations().list(
source=self.src,
target=self.dst,
q=[sentence]
).execute()
if 'translations' in result and len(result['translations']) and \\
'translatedText' in result['translations'][0]:
return result['translations'][0]['translatedText']
return ""

except KeyboardInterrupt:
return


def which(program):
def is_exe(fpath):
return os.path.isfile(fpath) and os.access(fpath, os.X_OK)

fpath, fname = os.path.split(program)
if fpath:
if is_exe(program):
return program
else:
for path in os.environ["PATH"].split(os.pathsep):
path = path.strip('"')
exe_file = os.path.join(path, program + ".exe")
if is_exe(exe_file):
return exe_file
return None


def extract_audio(filename, channels=1, rate=16000):
temp = tempfile.NamedTemporaryFile(suffix='.wav', delete=False)
if not os.path.isfile(filename):
print "The given file does not exist: {0}".format(filename)
raise Exception("Invalid filepath: {0}".format(filename))
if not which("ffmpeg"):
print "ffmpeg: Executable not found on machine."
raise Exception("Dependency not found: ffmpeg")
command = ["ffmpeg", "-y", "-i", filename, "-ac", str(channels), "-ar", str(rate), "-loglevel", "error", temp.name]
subprocess.check_output(command, stdin=open(os.devnull))

return temp.name, rate


def find_speech_regions(filename, frame_width=4096, min_region_size=0.5, max_region_size=6):
reader = wave.open(filename)
sample_width = reader.getsampwidth()
rate = reader.getframerate()
n_channels = reader.getnchannels()

total_duration = reader.getnframes() / rate
chunk_duration = float(frame_width) / rate

n_chunks = int(total_duration / chunk_duration)
energies = []

for i in range(n_chunks):
chunk = reader.readframes(frame_width)
energies.append(audioop.rms(chunk, sample_width * n_channels))

threshold = percentile(energies, 0.2)

elapsed_time = 0

regions = []
region_start = None

for energy in energies:
is_silence = energy <= threshold
max_exceeded = region_start and elapsed_time - region_start >= max_region_size

if (max_exceeded or is_silence) and region_start:
if elapsed_time - region_start >= min_region_size:
regions.append((region_start, elapsed_time))
region_start = None

elif (not region_start) and (not is_silence):
region_start = elapsed_time
elapsed_time += chunk_duration
return regions


def main():
parser = argparse.ArgumentParser()
parser.add_argument('source_path', help="Path to the video or audio file to subtitle", nargs='?')
parser.add_argument('-C', '--concurrency', help="Number of concurrent API requests to make", type=int, default=10)
parser.add_argument('-o', '--output',
help="Output path for subtitles (by default, subtitles are saved in \\
the same directory and name as the source path)")
parser.add_argument('-F', '--format', help="Destination subtitle format", default="srt")
parser.add_argument('-S', '--src-language', help="Language spoken in source file", default="en")

parser.add_argument('-D', '--dst-language', help="Desired language for the subtitles", default="en")
parser.add_argument('-K', '--api-key',
help="The Google Translate API key to be used. (Required for subtitle translation)")
parser.add_argument('--list-formats', help="List all available subtitle formats", action='store_true')
parser.add_argument('--list-languages', help="List all available source/destination languages", action='store_true')
parser.add_argument('-P', '--proxy', help="Set proxy server")

args = parser.parse_args()

if args.proxy:
proxy_dict.update({
'http': args.proxy,
'https': args.proxy,
'use': True
})
print("Use proxy " + args.proxy)


if args.list_formats:
print("List of formats:")
for subtitle_format in FORMATTERS.keys():
print("{format}".format(format=subtitle_format))
return 0

if args.list_languages:
print("List of all languages:")
for code, language in sorted(LANGUAGE_CODES.items()):
print("{code}\\t{language}".format(code=code, language=language))
return 0

if args.format not in FORMATTERS.keys():
print("Subtitle format not supported. Run with --list-formats to see all supported formats.")
return 1

if args.src_language not in LANGUAGE_CODES.keys():
print("Source language not supported. Run with --list-languages to see all supported languages.")
return 1

if args.dst_language not in LANGUAGE_CODES.keys():
print(
"Destination language not supported. Run with --list-languages to see all supported languages.")
return 1

if not args.source_path:
print("Error: You need to specify a source path.")
return 1

audio_filename, audio_rate = extract_audio(args.source_path)

regions = find_speech_regions(audio_filename)


pool = multiprocessing.Pool(args.concurrency)
converter = FLACConverter(source_path=audio_filename)
recognizer = SpeechRecognizer(language=args.src_language, rate=audio_rate, api_key=GOOGLE_SPEECH_API_KEY, proxy=proxy_dict)

transcripts = []
if regions:
try:
widgets = ["Converting speech regions to FLAC files: ", Percentage(), ' ', Bar(), ' ', ETA()]
pbar = ProgressBar(widgets=widgets, maxval=len(regions)).start()
extracted_regions = []
for i, extracted_region in enumerate(pool.imap(converter, regions)):
extracted_regions.append(extracted_region)
pbar.update(i)
pbar.finish()

widgets = ["Performing speech recognition: ", Percentage(), ' ', Bar(), ' ', ETA()]
pbar = ProgressBar(widgets=widgets, maxval=len(regions)).start()

for i, transcript in enumerate(pool.imap(recognizer, extracted_regions)):
transcripts.append(transcript)
pbar.update(i)
pbar.finish()

if not is_same_language(args.src_language, args.dst_language):
if args.api_key:
google_translate_api_key = args.api_key
translator = Translator(args.dst_language, google_translate_api_key, dst=args.dst_language,
class="lazy" data-original=args.src_language)
prompt = "Translating from {0} to {1}: ".format(args.src_language, args.dst_language)
widgets = [prompt, Percentage(), ' ', Bar(), ' ', ETA()]
pbar = ProgressBar(widgets=widgets, maxval=len(regions)).start()
translated_transcripts = []
for i, transcript in enumerate(pool.imap(translator, transcripts)):
translated_transcripts.append(transcript)
pbar.update(i)
pbar.finish()
transcripts = translated_transcripts
else:
print "Error: Subtitle translation requires specified Google Translate API key. \\
See --help for further information."
return 1

except KeyboardInterrupt:
pbar.finish()
pool.terminate()
pool.join()
print "Cancelling transcription"
return 1

timed_subtitles = [(r, t) for r, t in zip(regions, transcripts) if t]
formatter = FORMATTERS.get(args.format)
formatted_subtitles = formatter(timed_subtitles)

dest = args.output

if not dest:
base, ext = os.path.splitext(args.source_path)
dest = "{base}.{format}".format(base=base, format=args.format)


with open(dest, 'wb') as f:
f.write(formatted_subtitles.encode("utf-8"))

print "Subtitles file created at {}".format(dest)

os.remove(audio_filename)

return 0


if __name__ == '__main__':
sys.exit(main())/<code>

參考鏈接:

https://github.com/agermanidis/autosub/issues/31

https://github.com/qq2225936589/autosub/blob/master/autosub_app.py


分享到:


相關文章: