top of page

Local Video-to-Text Transcription Tool

  • Writer: Pavel Zosim
    Pavel Zosim
  • 2 days ago
  • 3 min read

The Problem

Recording meetings and videos is easy. Extracting useful information from them? Not so much.


I needed a tool to parse meeting recordings and generate clean transcripts for AI summarization - removing filler words, silence, and that one colleague who always goes off-topic about their cat :3


So I built one. (The tool, not the cat.) (⌐■_■)–︻╦╤─ - - - *ba dum tss*


What It Does

Drop a video file → Get clean text transcript → Feed to AI → Get actionable summary.

Tech: OpenAI Whisper + GPU acceleration + Python

Speed: 1-hour video → 6 minutes transcription (RTX 4060)

Privacy: 100% local processing. No cloud uploads.


Key Features

  • 🚀 GPU-accelerated - 10-15x faster than CPU ( ͡° ͜ʖ ͡°)ノ⌐■-■

  • 🎯 High accuracy - 90-95% with medium model

  • 🌍 99+ languages - Auto-detection included (dead languages not included, sorry Ancient Egypt) (҂◡_◡) ᕤ

  • 🔇 Smart filtering - Skips silence automatically (and your colleague's "umms")

  • 📦 Batch processing - Handle multiple files overnight (while you sleep like a normal person ☉ ‿ ⚆

  • 🔒 Private - Everything runs on your machine (NSA not invited)


Real Use Case

Original workflow:

  • 2-hour meeting recorded

  • 30 minutes reviewing and taking notes

  • Scattered information, missing context

With this tool:

  • 2-hour meeting recorded ʕノ•ᴥ•ʔノ ︵ ┻━┻

  • 6 minutes auto-transcribed

  • 2 minutes AI summary (ChatGPT/Claude)

  • Clean document with decisions and action items

Time saved: ~85%


Setup


# 1. Clone
git clone https://github.com/pavelzosim/video-transcription-tool.git

# 2. Install (Windows GPU)
install_gpu.bat

# 3. Run
run_transcription.bat

Drop videos in video/ folder. Transcripts appear in output/.


Performance

Video

Model

Time

Speed

10 min

medium

3 min

3.3x

1 hour

medium

18 min

3.3x

1 hour

small

12 min

5.0x

CPU processing: 40-60 minutes for 1-hour video


Use Cases

  • Meetings - Extract action items and decisions

  • Interviews - Transcribe for content creation

  • Lectures - Convert recordings to study notes

  • Podcasts - Generate show notes automatically


AI Integration

The tool generates clean text perfect for AI summarization. Example prompt:


Analyze this meeting transcript:
1. Key decisions made
2. Action items (with owners)
3. Topics discussed
4. Follow-up required

Remove filler words and focus on actionable info.

Tech Stack

  • faster-whisper - Optimized Whisper implementation

  • CTranslate2 - 4x inference speed boost

  • PyTorch - GPU acceleration

  • FFmpeg - Audio preprocessing


Why Local?

No cloud uploads = no privacy concerns. Perfect for:

  • Confidential meetings

  • Client calls

  • Internal discussions

  • GDPR compliance


Requirements

  • Python 3.8+ (if you're still on Python 2, we need to talk)

  • NVIDIA GPU (optional but recommended - your CPU will thank you)

  • FFmpeg (for faster processing and to feel like a hacker)


Design Choices

User-friendly first: Interactive menu instead of command-line parameters. Non-technical users can run it without reading docs (because let's be honest, nobody reads docs).


Video Transcription Tool interface with green text on black screen showing settings and options like model size, language, and transcription.
Old school terminal interface! :3

Smart defaults: Medium model, VAD enabled, beam size 5. Works great out of the box.

Error handling: Gracefully handles corrupted files, missing audio, format issues. (Tested with videos recorded on a potato.)

Results

Personal metrics after 2 months:

  • Processed: 60+ hours of meetings

  • Time saved: ~15 hours

  • Accuracy: 93% average (medium model)


Get It

📖 Docs: Full setup guide included

💬 Issues: Bug reports and features welcome

📄 License: MIT - use it however you want


❤️ Built for productivity. Optimized for meetings. Free and open-source.


( ´◔ ω◔`) ノシ Support: Buy Me a Coffee | Patreon | GitHub

Comments


bottom of page