Metadata-Version: 2.4
Name: tingshuo
Version: 0.1.6
Summary: Generate SRT/LRC subtitles and Markdown transcripts from audio/video files with auto-correction, content summarization, and multimodal video analysis
Author-email: TingShuo Team <wedonotuse@outlook.com>
License-Expression: GPL-3.0-or-later
Project-URL: Homepage, https://github.com/cycleuser/TingShuo
Project-URL: Repository, https://github.com/cycleuser/TingShuo
Project-URL: Issues, https://github.com/cycleuser/TingShuo/issues
Keywords: subtitle,srt,lrc,speech-to-text,whisper,vosk,transcription,transcript,auto-correct,summarization,multimodal
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Environment :: X11 Applications
Classifier: Intended Audience :: End Users/Desktop
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
Classifier: Topic :: Multimedia :: Video
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: faster-whisper
Requires-Dist: faster-whisper>=1.0.0; extra == "faster-whisper"
Provides-Extra: vosk
Requires-Dist: vosk>=0.3.45; extra == "vosk"
Provides-Extra: whisper
Requires-Dist: openai-whisper>=20231117; extra == "whisper"
Provides-Extra: whisper-cpp
Requires-Dist: pywhispercpp>=1.0.0; extra == "whisper-cpp"
Provides-Extra: nlp
Requires-Dist: nltk>=3.8; extra == "nlp"
Provides-Extra: translation
Requires-Dist: transformers>=4.30.0; extra == "translation"
Requires-Dist: sentencepiece>=0.1.99; extra == "translation"
Provides-Extra: all
Requires-Dist: faster-whisper>=1.0.0; extra == "all"
Requires-Dist: vosk>=0.3.45; extra == "all"
Requires-Dist: openai-whisper>=20231117; extra == "all"
Requires-Dist: pywhispercpp>=1.0.0; extra == "all"
Requires-Dist: nltk>=3.8; extra == "all"
Requires-Dist: transformers>=4.30.0; extra == "all"
Requires-Dist: sentencepiece>=0.1.99; extra == "all"
Dynamic: license-file

# TingShuo 听说

**Generate SRT/LRC subtitles and Markdown transcripts from audio/video files using multiple speech-to-text engines, with auto-correction, LLM polishing, and multimodal content summarization.**

TingShuo recursively scans directories for media files, transcribes them using your choice of STT engine, and outputs subtitle files in SRT, LRC, or Markdown transcript format. Features include LLM-based auto-correction of typos and verbal mistakes, subtitle polishing via LLM or NLP, and content summarization with multimodal video analysis.

## Features

- **4 STT Engines**: faster-whisper, Vosk, OpenAI Whisper, whisper.cpp
- **3 Output Formats**: SRT (SubRip), LRC (lyrics), and MD (Markdown transcript)
- **Markdown Transcript**: Generate clean, structured transcripts from speeches and lectures
- **Auto-Correction**: Fix typos, wrong characters, and verbal mistakes automatically via LLM
- **Content Summarization**: Summarize audio/video content with multimodal video analysis (keyframe extraction + vision LLM)
- **Subtitle Translation**: Translate subtitles to multiple target languages using NLLB or LLM
- **Multi-language UI**: Interface supports English, Chinese, Japanese, Korean, French, German, Spanish, Italian, Portuguese, Russian
- **LLM Polishing**: Merge fragmented subtitles into natural sentences via Ollama or OpenAI-compatible API
- **NLP Polishing**: Sentence boundary detection via nltk (no LLM required)
- **CLI + GUI**: Full command-line interface and tkinter graphical interface
- **Recursive Scanning**: Process entire directory trees of media files
- **HuggingFace Mirror**: Built-in support for HF mirror (useful in China mainland)
- **Flexible Output**: Save subtitles alongside source files or to a custom directory
- **Settings Persistence**: UI language and preferences saved to `~/.config/tingshuo/settings.json`

## Installation

### From PyPI

```bash
# Base install (no STT engine included)
pip install tingshuo

# With a specific engine:
pip install tingshuo[faster-whisper]   # Recommended
pip install tingshuo[vosk]
pip install tingshuo[whisper]
pip install tingshuo[whisper-cpp]

# With NLP polishing:
pip install tingshuo[nlp]

# Everything:
pip install tingshuo[all]
```

### From Source

```bash
git clone https://github.com/cycleuser/TingShuo.git
cd tingshuo
pip install -e .[faster-whisper,nlp]
```

## Prerequisites

- **Python 3.9+**
- **ffmpeg** must be installed and available on your PATH
  - Linux: `sudo apt install ffmpeg`
  - macOS: `brew install ffmpeg`
  - Windows: Download from [ffmpeg.org](https://ffmpeg.org/download.html) and add to PATH

## Quick Start

### CLI

**Basic transcription (SRT):**
```bash
tingshuo -i ./videos -e faster-whisper -f srt
```

**Generate LRC files to a specific output directory:**
```bash
tingshuo -i ./audio -e vosk -f lrc -o ./subtitles
```

**With LLM polishing (Ollama):**
```bash
tingshuo -i ./media --polish-llm --ollama-model qwen2.5
```

**With LLM polishing (OpenAI-compatible API):**
```bash
tingshuo -i ./media --polish-llm --api-url https://api.example.com --api-key sk-xxx --api-model gpt-4o-mini
```

**With NLP polishing:**
```bash
tingshuo -i ./media --polish-nlp -l en
```

**Generate Markdown transcript from lectures:**
```bash
tingshuo -i ./lectures -f md --polish-llm --ollama-model qwen2.5
```

**Auto-correct typos and verbal mistakes:**
```bash
tingshuo -i ./media --auto-correct --ollama-model qwen2.5
```

**Auto-correct + LLM polishing combined:**
```bash
tingshuo -i ./media --auto-correct --polish-llm --ollama-model qwen2.5
```

**Generate content summary:**
```bash
tingshuo -i ./media --summarize --ollama-model qwen2.5
```

**Summarize with multimodal video analysis (OpenAI-compatible API):**
```bash
tingshuo -i ./videos --summarize --api-url https://api.example.com --api-key sk-xxx --api-model gpt-4o-mini
```

**Specify language and model:**
```bash
tingshuo -i ./videos -e faster-whisper -m large-v3 -l zh
```

**Use HuggingFace mirror (China mainland):**
```bash
tingshuo -i ./videos -e faster-whisper --hf-mirror https://hf-mirror.com
```

**Translate subtitles to multiple languages (NLLB):**
```bash
tingshuo -i ./videos -e faster-whisper --translate --target-lang zh,ja,ko
```

**Translate subtitles using LLM:**
```bash
tingshuo -i ./videos -e faster-whisper --translate --target-lang zh --trans-backend llm --ollama-model qwen2.5
```

**Download a model before transcription:**
```bash
tingshuo --download -e faster-whisper -m large-v3
tingshuo --download -e faster-whisper -m large-v3 --hf-mirror https://hf-mirror.com
```

**Download all models for an engine:**
```bash
tingshuo --download-all -e faster-whisper
```

**List installed Ollama models:**
```bash
tingshuo --list-ollama-models
tingshuo --list-ollama-models --ollama-url http://192.168.1.100:11434
```

### GUI

```bash
tingshuo --gui
```

The GUI provides:
- Directory selection with browse buttons
- Engine and model selection dropdowns
- **Language dropdown** with common languages (auto-detect, zh, en, ja, ko, etc.) or type custom codes
- **Model download buttons** (Download / Download All) with progress feedback
- Format toggle (SRT/LRC/MD)
- **Auto-correction checkbox**: Enable LLM-based auto-correction of transcription errors
- **Content summary checkbox**: Generate summary alongside output, with keyframe interval setting
- Polishing options (None / LLM / NLP) with configuration panels
- **Translation panel**: Enable translation, select target languages, choose backend (NLLB or LLM)
- **Ollama model dropdown** with Refresh button to query installed models from the server
- **Menu bar**: Help > Settings (UI language), Help > About (version info)
- **Multi-language interface**: Settings allow switching between 10 UI languages
- HuggingFace mirror toggle
- Progress bar and real-time log output
- Start/Stop controls

## CLI Reference

```
usage: tingshuo [-h] [--version] [--gui] [-i DIR] [-o DIR] [-f {srt,lrc,md}]
                [--no-recursive] [-e ENGINE] [-m NAME] [-l CODE]
                [--hf-mirror URL] [--download] [--download-all]
                [--list-ollama-models] [--auto-correct]
                [--polish-llm | --polish-nlp]
                [--ollama-url URL] [--ollama-model NAME] [--api-url URL]
                [--api-key KEY] [--api-model NAME] [-v]
                [--translate] [--target-lang CODES]
                [--trans-backend {nllb,llm}] [--nllb-model NAME]
                [--summarize] [--keyframe-interval SECONDS]
```

### Input/Output

| Argument | Description |
|----------|-------------|
| `-i`, `--input DIR` | Input directory containing audio/video files (required) |
| `-o`, `--output DIR` | Output directory for subtitles (default: same as source) |
| `-f`, `--format {srt,lrc,md}` | Output format: srt, lrc, or md (Markdown transcript) (default: srt) |
| `--no-recursive` | Do not scan subdirectories |

### STT Engine

| Argument | Description |
|----------|-------------|
| `-e`, `--engine` | Engine: `faster-whisper`, `vosk`, `whisper`, `whisper-cpp` (default: faster-whisper) |
| `-m`, `--model NAME` | Model name or path (default: engine-specific, usually "base") |
| `-l`, `--language CODE` | Language code: zh, en, ja, etc. Use "auto" for auto-detection (default: auto) |

### HuggingFace Mirror

| Argument | Description |
|----------|-------------|
| `--hf-mirror URL` | HuggingFace mirror URL, e.g. `https://hf-mirror.com` |

### Model Management

| Argument | Description |
|----------|-------------|
| `--download` | Download the model specified by `-e` and `-m`, then exit |
| `--download-all` | Download all known models for the engine specified by `-e`, then exit |
| `--list-ollama-models` | List installed Ollama models from the server (uses `--ollama-url`), then exit |

### Subtitle Polishing

| Argument | Description |
|----------|-------------|
| `--polish-llm` | Polish with LLM (Ollama or OpenAI-compatible API) |
| `--polish-nlp` | Polish with NLP sentence segmentation (nltk) |

### Auto-Correction

| Argument | Description |
|----------|-------------|
| `--auto-correct` | Auto-correct typos, wrong characters, and verbal mistakes using LLM |

### LLM Settings

| Argument | Description |
|----------|-------------|
| `--ollama-url URL` | Ollama API URL (default: http://localhost:11434) |
| `--ollama-model NAME` | Ollama model name (default: qwen2.5) |
| `--api-url URL` | OpenAI-compatible API base URL |
| `--api-key KEY` | API key for OpenAI-compatible service |
| `--api-model NAME` | Model name for API |

### Other

| Argument | Description |
|----------|-------------|
| `--gui` | Launch graphical interface |
| `-v`, `--verbose` | Enable debug logging |
| `--version` | Show version and exit |

### Translation

| Argument | Description |
|----------|-------------|
| `--translate` | Enable subtitle translation to target language(s) |
| `--target-lang CODES` | Comma-separated target language codes, e.g. `zh,en,ja` |
| `--trans-backend {nllb,llm}` | Translation backend: `nllb` (Helsinki-NLP/NLLB) or `llm` (default: nllb) |
| `--nllb-model NAME` | NLLB model name (default: facebook/nllb-200-distilled-600M) |

### Summarization

| Argument | Description |
|----------|-------------|
| `--summarize` | Generate a content summary (.summary.md) alongside the output |
| `--keyframe-interval SECONDS` | Seconds between keyframe extractions for video summarization (default: 60) |

## Supported Formats

### Input (Audio/Video)

**Audio**: mp3, wav, flac, aac, ogg, wma, m4a, opus

**Video**: mp4, mkv, avi, mov, wmv, flv, webm, ts, m4v, mpg, mpeg

### Output

**SRT** (SubRip Text):
```
1
00:00:01,500 --> 00:00:04,200
This is the first subtitle line.

2
00:00:05,000 --> 00:00:08,300
This is the second subtitle line.
```

**LRC** (Lyrics):
```
[ti:filename]
[re:TingShuo v0.1.3]

[00:01.50]This is the first subtitle line.
[00:05.00]This is the second subtitle line.
```

**MD** (Markdown Transcript):
```markdown
## Introduction

This is the opening section of the speech, organized into
natural paragraphs by the LLM.

## Main Topic

The speaker then moved on to discuss the main topic,
with key points organized into readable paragraphs.
```

## STT Engines

### faster-whisper (Recommended)

CTranslate2-based Whisper implementation. Fast, supports GPU acceleration.

```bash
pip install faster-whisper
```

**Models**: tiny, base, small, medium, large-v2, large-v3

### Vosk

Lightweight offline speech recognition. Lower accuracy but very fast on CPU.

```bash
pip install vosk
```

**Models**: Downloaded automatically by language, or specify a local path with `-m /path/to/model`.

### OpenAI Whisper

The original Whisper model from OpenAI.

```bash
pip install openai-whisper
```

**Models**: tiny, base, small, medium, large

### whisper.cpp

C++ implementation of Whisper via Python bindings. Very fast on CPU.

```bash
pip install pywhispercpp
```

**Models**: tiny, base, small, medium, large

## Subtitle Polishing

### LLM Polishing

Sends subtitle segments to an LLM to merge fragments into complete, natural sentences.

**With Ollama (local):**

1. Install and start [Ollama](https://ollama.com)
2. Pull a model: `ollama pull qwen2.5`
3. Run: `tingshuo -i ./media --polish-llm --ollama-model qwen2.5`

**With Ollama (LAN):**
```bash
tingshuo -i ./media --polish-llm --ollama-url http://192.168.1.100:11434 --ollama-model qwen2.5
```

**With OpenAI-compatible API:**
```bash
tingshuo -i ./media --polish-llm --api-url https://api.openai.com --api-key sk-xxx --api-model gpt-4o-mini
```

### NLP Polishing

Uses nltk sentence tokenization to detect sentence boundaries and merge fragments. No LLM or network access required.

```bash
pip install nltk
tingshuo -i ./media --polish-nlp -l en
```

Supports English, German, French, Spanish, Italian, Portuguese, and more via nltk. For Chinese/Japanese/Korean, uses punctuation-based sentence splitting.

## Markdown Transcript

TingShuo can generate clean, structured Markdown transcripts from speeches, lectures, and presentations. Instead of timestamped subtitles, the MD format produces flowing text organized into sections and paragraphs.

```bash
# Generate Markdown transcript (uses LLM to structure paragraphs)
tingshuo -i ./lectures -f md --polish-llm --ollama-model qwen2.5

# With auto-correction for cleaner output
tingshuo -i ./lectures -f md --auto-correct --polish-llm --ollama-model qwen2.5
```

The LLM organizes the raw transcription into logical sections with Markdown headers and paragraphs. If no LLM is configured, a simple paragraph grouping fallback is used.

## Auto-Correction

TingShuo can automatically fix transcription errors before polishing or output. This includes:

- **Typos and wrong characters** (错别字): Common misrecognitions from STT engines
- **Verbal mistakes** (口误): Slips of the tongue in speech
- **Filler words**: Remove "um", "uh", "嗯", "那个", etc. when they add no meaning

```bash
# Auto-correct only
tingshuo -i ./media --auto-correct --ollama-model qwen2.5

# Auto-correct + LLM polishing (correction happens first, then polishing)
tingshuo -i ./media --auto-correct --polish-llm --ollama-model qwen2.5

# Auto-correct with OpenAI-compatible API
tingshuo -i ./media --auto-correct --api-url https://api.example.com --api-key sk-xxx --api-model gpt-4o-mini
```

Auto-correction preserves segment boundaries (timestamps remain unchanged) and works with all output formats (SRT, LRC, MD).

## Content Summarization

TingShuo can generate a content summary (`.summary.md`) alongside the normal output. For video files, it supports multimodal analysis using keyframe extraction and vision-capable LLMs.

### Text-Only Summary (Audio or Video)

```bash
# Summarize using Ollama
tingshuo -i ./media --summarize --ollama-model qwen2.5

# Summarize using OpenAI-compatible API
tingshuo -i ./media --summarize --api-url https://api.example.com --api-key sk-xxx --api-model gpt-4o-mini
```

### Multimodal Video Summary

For video files, TingShuo extracts keyframes using ffmpeg and sends them along with the transcript to a vision-capable LLM for comprehensive analysis:

```bash
# Multimodal summary with keyframe extraction (default: 60s intervals)
tingshuo -i ./videos --summarize --api-url https://api.example.com --api-key sk-xxx --api-model gpt-4o-mini

# Custom keyframe interval (every 30 seconds)
tingshuo -i ./videos --summarize --keyframe-interval 30 --api-url https://api.example.com --api-key sk-xxx --api-model gpt-4o-mini

# With Ollama multimodal models (e.g., llava, llama3.2-vision)
tingshuo -i ./videos --summarize --ollama-model llava
```

The multimodal summary integrates:
- Spoken content from the transcript
- Visual elements: slides, diagrams, charts, demonstrations
- Key visual information that complements the spoken content

If the LLM does not support vision, TingShuo automatically falls back to a text-only summary.

## Subtitle Translation

TingShuo can automatically translate generated subtitles to multiple target languages. Translated subtitles are saved as separate files with language codes (e.g., `video.zh.srt`, `video.ja.srt`).

### NLLB Translation (Recommended)

Uses Helsinki-NLP/NLLB models for high-quality offline translation supporting 200+ languages.

```bash
# Install dependencies
pip install transformers sentencepiece

# Translate to Chinese and Japanese
tingshuo -i ./videos -e faster-whisper --translate --target-lang zh,ja

# Use a larger NLLB model for better quality
tingshuo -i ./videos --translate --target-lang zh --nllb-model facebook/nllb-200-distilled-1.3B
```

Available NLLB models: `facebook/nllb-200-distilled-600M` (default), `facebook/nllb-200-distilled-1.3B`, `facebook/nllb-200-3.3B`

### LLM Translation

Uses Ollama or OpenAI-compatible API for translation.

```bash
# Translate using Ollama
tingshuo -i ./videos --translate --target-lang zh --trans-backend llm --ollama-model qwen2.5

# Translate using OpenAI API
tingshuo -i ./videos --translate --target-lang zh --trans-backend llm --api-url https://api.openai.com --api-key sk-xxx --api-model gpt-4o-mini
```

## HuggingFace Mirror

For users in China mainland who have difficulty downloading models from HuggingFace:

```bash
tingshuo -i ./videos -e faster-whisper --hf-mirror https://hf-mirror.com
```

Or set the environment variable directly:
```bash
export HF_ENDPOINT=https://hf-mirror.com
tingshuo -i ./videos -e faster-whisper
```

## License

This project is licensed under the GNU General Public License v3.0. See [LICENSE](LICENSE) for details.
