Mengganti Cloud dengan Command Prompt: Perjalanan Membangun TTS Offline yang Mandiri di Windows

Oleh Hajriah Fajar January 10, 2026 Post a Comment

Mengganti Cloud dengan Command Prompt: Perjalanan Membangun TTS Offline yang Mandiri di Windows

Setelah membahas sistem-sistem yang menjaga stabilitas aset dan server, mari kita masuk ke lapisan yang lebih dalam: membangun layanan dari nol saat layanan cloud yang ada tak lagi bisa diandalkan. Ini adalah cerita tentang menggantikan sebuah API berbayar dengan baris perintah, file DLL, dan kesabaran membaca log error.

Awalnya sederhana. Ada kebutuhan untuk fitur Text-to-Speech (TTS) Bahasa Indonesia dalam sebuah aplikasi internal. Biasanya, solusinya mudah: panggil API Google Cloud Text-to-Speech atau Azure Cognitive Services. Masukkan kredit kartu, dapatkan API key, dan selesai. Tapi kali ini, tidak. Kebijakan baru melarang data suara keluar ke pihak ketiga. Plus, koneksi internet di lokasi itu fluktuatif. Cloud bukan pilihan. Kita harus membangun mesin yang berjalan sepenuhnya offline, di Windows, dengan suara yang lumayan natural.

Pencarian dimulai. "Offline TTS Indonesia Windows." Hasilnya? Sedikit. Ada yang menyarankan NVDA dengan add-on, tapi itu untuk screen reader, bukan untuk integrasi programatik. Ada yang sebut eSpeak, tapi suaranya seperti robot tahun 90-an. Lalu, saya menemukan Coqui TTS. Sebuah framework open-source berbasis deep learning. Di GitHub-nya ada model untuk berbagai bahasa. Dan ada satu model untuk Bahasa Indonesia. Matanya berbinar. Ini dia! Tinggal download model, install Python, jalankan script. Mudah, kan?

Ternyata, tidak.

Pertama, ada perang versi. Coqui TTS butuh Python 3.7 sampai 3.9. Windows saya sudah ada Python 3.11. "Ah, tinggal install versi lama," pikir saya. Install Python 3.9. Lalu coba `pip install TTS`. Error. Butuh Microsoft Visual C++ Build Tools. Baiklah, download yang 2 GB itu. Install. Coba lagi. Error lain: `torch` tidak kompatibel dengan versi CUDA. Saya bahkan tidak pakai GPU. Setelah googling, harus install torch dengan flag khusus: `pip install torch --index-url https://download.pytorch.org/whl/cpu`. Baru berhasil.

Setelah berjam-jam, akhirnya `TTS` terinstall. Saatnya download model Indonesia. Perintahnya: `tts --model_name "tts_models/id/cv/vits" --text "Halo dunia" --out_path halo.wav`. Enter.

Command Prompt membeku. Lalu, muncul error merah: `RuntimeError: Cannot load tokenizer from 'C:\Users\...\local\cache\torch\tts\...'. No file named tokenizer.json found.`

Modelnya tidak lengkap. File konfigurasi atau tokenizer-nya hilang. Ini adalah momen khas dalam kerja sunyi sistem: dokumentasi di README.md terlihat mulus, tapi implementasinya penuh dengan lubang tersembunyi yang hanya ketemu saat kita benar-benar menjalankannya. Solusinya? Harus download manual file-model dari Hugging Face, taruh di folder cache yang tepat, dengan struktur folder yang benar. Ini seperti merakit puzzle tanpa gambar petunjuk.

Saya buka Hugging Face, cari model "cv/vits Indonesia". Nemunya bukan satu file, tapi sekumpulan file: `config.json`, `model.pth`, `vocab.json`, dll. Unduh satu-satu. Lalu pelajari struktur folder cache Coqui TTS dengan membongkar kode sumbernya. Harus dibuat folder `C:\Users\[user]\.local\cache\torch\tts\tts_models\id\cv\vits\`. File-file tadi ditaruh di sana. Coba lagi perintah tadi.

Kali ini, progress bar muncul! Tapi tiba-tiba berhenti di 60%. Error: `OSError: [WinError 126] The specified module could not be found`. Module DLL mana yang hilang? Setelah telusuri error trace yang panjang, ternyata library `espeak` tidak terdeteksi. Coqui TTS bergantung pada eSpeak untuk text normalization (mengubah angka "123" jadi "seratus dua puluh tiga"). Tapi eSpeak untuk Windows bukan executable biasa, tapi DLL yang harus ditempatkan di PATH system atau folder yang bisa dilihat Python.

Download eSpeak NG untuk Windows. Ekstrak. Tidak ada installer. Cuma ada `espeak-ng.dll` di folder `lib`. Saya copy DLL itu ke folder `C:\Windows\System32` (langkah nekat). Coba lagi. Error yang sama. Ternyata, Python mencari library melalui `ctypes` dengan nama `espeak.dll`, bukan `espeak-ng.dll`. Saya rename filenya. Coba lagi.

Masih error. Butuh lebih dari satu DLL. Ada `sonic.dll` juga. Akhirnya, saya tambahkan seluruh folder `lib` dari eSpeak NG ke Environment Variable `PATH` di Windows. Restart Command Prompt. Dan… diam. Tidak ada error. Tapi tidak ada output suara juga. Proses berjalan, tapi kemudian hang.

Ini titik paling frustasi. Tidak ada error message. Hanya spinning cursor. Saya buka Task Manager, ternyata proses Python menggunakan CPU 25% (satu core penuh) dan memory naik pelan. Ia sedang bekerja, tapi mungkin terjebak dalam infinite loop atau menunggu sesuatu. Setelah menunggu 5 menit, saya kill prosesnya.

Pencarian log yang lebih detail dimulai. Saya tambahkan flag `--debug` ke perintah TTS. Sekarang, sungai log mengalir. Dan di sana, tersembunyi di antara ratusan baris, ada pesan: `WARNING: No GPU found. Using CPU. This will be slow.` Itu wajar. Lalu ada: `INFO: Text normalized to: "Halo dunia"`. Bagus. Lalu: `INFO: Running model inference...`. Dan di situ ia berhenti.

Setelah baca dokumentasi dan issue di GitHub, ternyata model VITS butuh library `librosa` dengan versi tertentu untuk audio processing. Saya cek: `pip show librosa`. Versi yang terinstall adalah 0.10.0. Di requirements model, butuh 0.9.2. Downgrade: `pip install librosa==0.9.2`. Coba lagi.

Progress bar bergerak lagi! Sampai 100%! Lalu… `File "halo.wav" saved.`. Jantung berdebar. Saya klik file `halo.wav`. Speaker mengeluarkan suara: "Halo dunia" dengan aksen Indonesia yang cukup jelas. Tidak semulus Google, tapi jauh lebih baik daripada robot eSpeak. Ini kemenangan kecil yang terasa seperti menaklukkan gunung.

Tapi tugas belum selesai. Ini baru bisa jalan di command prompt dengan environment Python saya yang sudah berantakan. Bagaimana agar ini bisa dipanggil sebagai layanan oleh aplikasi lain? Kita butuh membuat wrapper. Saya buat script Python sederhana yang membaca teks dari argument command line atau file, lalu generate audio. Lalu, buat batch file `tts.bat` yang memanggil Python dengan environment yang tepat.

Masalah baru: path absolut. Script harus tahu lokasi model. Tidak boleh hardcode `C:\Users\saya`. Harus dinamis. Saya atur dengan mengambil path berdasarkan lokasi script dijalankan. Lalu, masalah library path. Agar portable, semua dependency harus bisa ditemukan. Akhirnya, saya putuskan untuk membungkus semuanya dalam virtual environment Python yang lengkap, lalu bundle dengan PyInstaller menjadi satu file executable `.exe` yang mandiri.

Proses `pyinstaller` sendiri adalah petualangan lain. Butuh spec file khusus untuk menyertakan file model dan DLL eSpeak. Setelah berkali-kali trial and error, akhirnya tercipta `tts_server.exe`. File sebesar 450 MB (karena menyertakan model dan library PyTorch). Tapi ia bisa di-copy ke Windows lain, ditaruh di folder mana saja, dan langsung jalan. Tinggal jalankan `tts_server.exe "Halo dunia"` dan keluar file `output.wav`.

Dari sini, tinggal selangkah ke integrasi. Buat simple HTTP server dengan Flask di dalam `tts_server.exe`? Bisa, tapi itu menambah kompleksitas. Saya pilih solusi lebih sederhana: aplikasi pemanggil cukup eksekusi command line itu dan baca file output-nya. Itu sudah "API" yang paling primitif dan robust.

Ketika akhirnya sistem besar itu berhasil memanggil TTS offline dan mendapatkan audio yang diinginkan, tidak ada yang heboh. Hanya sebuah file audio muncul di folder `temp`. Tapi bagi saya, setiap suku kata yang dihasilkan adalah bukti dari kerja sunyi yang tak terhitung: mengatasi missing DLL, version conflict, path error, dan model yang rewel.

Cloud menawarkan ilusi kesederhanaan: satu API call, dapat hasil. Tapi ia mengambil harga: ketergantungan, biaya berulang, dan kecemasan akan konektivitas. Membangun solusi offline seperti ini mengembalikan kendali. Kita tahu persis setiap komponennya, kita bisa perbaiki jika rusak, dan yang paling penting: ia akan selalu bekerja selama Windows-nya masih hidup, bahkan di ruangan tanpa sinyal sekali pun.

Akhirnya, suara yang dihasilkan bukan sekadar output audio, tetapi bukti dari kerja sunyi sistem: sebuah kemandirian yang dibangun dari memahami—benar-benar memahami—setiap kegagalan kompilasi, setiap library yang hilang, dan setiap parameter yang harus diatur tepat agar mesin di balik layar itu mau bersuara.

FAQ (Tanya-Jawab Ringan)

Q: Kenapa tidak pakai Windows Speech API saja? Kan sudah built-in.
A> Windows Speech API (SAPI) memang ada, tapi suara Bahasa Indonesia-nya sangat terbatas dan kualitasnya seperti robot era Windows XP. Untuk keperluan yang butuh kejelasan dan naturalness, terutama untuk audio publik, tidak cukup baik. Plus, kontrol terhadap model dan outputnya sangat minimal.

Q> 450 MB untuk satu file EXE? Bukannya sangat besar?
A> Benar. Besarnya berasal dari library machine learning (PyTorch ~400MB) dan model suara. Ini adalah trade-off untuk kemandirian. Tidak perlu install Python atau library lain. File itu sendiri adalah satu lingkungan yang lengkap. Untuk skala enterprise, besarnya tidak masalah dibandingkan keandalan dan kepatuhan terhadap regulasi data.

Q: Apakah proses ini bisa diotomatiskan atau dibuat installer?
A> Bisa. Seluruh proses dari install dependency, download model, hingga build dengan PyInstaller bisa ditulis dalam script PowerShell atau Python. Tapi, membangun installer yang robust untuk semua tipe Windows (versi, architecture, security policy) adalah "kerja sunyi" lain yang sama kompleksnya.

Q: Bagaimana dengan update model atau perbaikan bug?
A> Ini kelemahan solusi mandiri. Update harus dilakukan manual: download model baru, rebuild executable, deploy ke semua mesin. Tapi di sisi lain, ini juga kelebihan: stabilitas. Tidak ada perubahan mendadak dari provider cloud yang merusak integrasi. Versi yang sudah bekerja akan tetap bekerja bertahun-tahun.

Q: Apa lesson learned terbesar dari proyek ini?
A> Bahwa "offline" dan "mandiri" itu bukan fitur, tapi arkitektur. Ia membutuhkan pertimbangan desain dari awal: manajemen dependency, packaging, deployment, dan maintenance. Kerja sunyi terbesar ada di fase pra-produksi, memastikan semua bagian tersambung dengan benar, jauh sebelum pengguna menekan tombol "play".

Replacing Cloud with Command Prompt: The Journey of Building an Independent Offline TTS on Windows

After discussing systems that maintain asset and server stability, let's go deeper: building a service from scratch when existing cloud services can no longer be relied upon. This is a story of replacing a paid API with command lines, DLL files, and the patience to read error logs.

It started simply. There was a need for an Indonesian Text-to-Speech (TTS) feature in an internal application. Usually, the solution is easy: call the Google Cloud Text-to-Speech API or Azure Cognitive Services. Enter credit card, get API key, done. But this time, no. New policies prohibited voice data from being sent to third parties. Plus, internet connectivity at that location was unstable. Cloud was not an option. We had to build a machine that runs completely offline, on Windows, with a reasonably natural voice.

The search began. "Offline TTS Indonesia Windows." Results? Few. Some suggested NVDA with an add-on, but that's for screen readers, not programmatic integration. Some mentioned eSpeak, but its voice sounded like a robot from the 90s. Then, I found Coqui TTS. An open-source deep learning-based framework. Its GitHub had models for various languages. And there was one model for Indonesian. My eyes lit up. This is it! Just download the model, install Python, run the script. Easy, right?

Turns out, no.

First, there was a version war. Coqui TTS needed Python 3.7 to 3.9. My Windows already had Python 3.11. "Ah, just install the old version," I thought. Installed Python 3.9. Then tried `pip install TTS`. Error. Needed Microsoft Visual C++ Build Tools. Alright, download that 2 GB file. Install. Try again. Another error: `torch` incompatible with CUDA version. I wasn't even using a GPU. After googling, had to install torch with a special flag: `pip install torch --index-url https://download.pytorch.org/whl/cpu`. Finally succeeded.

After hours, `TTS` was finally installed. Time to download the Indonesian model. Command: `tts --model_name "tts_models/id/cv/vits" --text "Hello world" --out_path hello.wav`. Enter.

Command Prompt froze. Then, a red error appeared: `RuntimeError: Cannot load tokenizer from 'C:\Users\...\local\cache\torch\tts\...'. No file named tokenizer.json found.`

The model was incomplete. Its configuration or tokenizer file was missing. This is a typical moment in quiet system work: the documentation in README.md looks smooth, but the implementation is full of hidden holes only discovered when you actually run it. The solution? Had to manually download model files from Hugging Face, place them in the correct cache folder, with the correct folder structure. It's like assembling a puzzle without the guide picture.

I opened Hugging Face, searched for the "cv/vits Indonesia" model. Found not one file, but a collection: `config.json`, `model.pth`, `vocab.json`, etc. Downloaded them one by one. Then studied the Coqui TTS cache folder structure by digging into its source code. Had to create the folder `C:\Users\[user]\.local\cache\torch\tts\tts_models\id\cv\vits\`. Placed the files there. Tried the command again.

This time, a progress bar appeared! But suddenly stopped at 60%. Error: `OSError: [WinError 126] The specified module could not be found`. Which DLL module was missing? After tracing the long error log, it turned out the `espeak` library was not detected. Coqui TTS depends on eSpeak for text normalization (turning numbers "123" into "one hundred twenty three"). But eSpeak for Windows isn't a regular executable, but a DLL that must be placed in the system PATH or a folder Python can see.

Downloaded eSpeak NG for Windows. Extracted. No installer. Just `espeak-ng.dll` in the `lib` folder. I copied that DLL to `C:\Windows\System32` (a desperate move). Tried again. Same error. It turned out Python was looking for the library via `ctypes` with the name `espeak.dll`, not `espeak-ng.dll`. I renamed the file. Tried again.

Still error. Needed more than one DLL. There was `sonic.dll` too. Finally, I added the entire `lib` folder from eSpeak NG to the Windows Environment Variable `PATH`. Restarted Command Prompt. And… silence. No error. But no audio output either. The process ran, but then hung.

This was the most frustrating point. No error message. Just a spinning cursor. I opened Task Manager, the Python process was using 25% CPU (one full core) and memory was slowly rising. It was working, but maybe stuck in an infinite loop or waiting for something. After waiting 5 minutes, I killed the process.

A search for more detailed logs began. I added the `--debug` flag to the TTS command. Now, a river of logs flowed. And there, hidden among hundreds of lines, was a message: `WARNING: No GPU found. Using CPU. This will be slow.` That's normal. Then: `INFO: Text normalized to: "Hello world"`. Good. Then: `INFO: Running model inference...`. And there it stopped.

After reading documentation and GitHub issues, it turned out the VITS model needed the `librosa` library with a specific version for audio processing. I checked: `pip show librosa`. Installed version was 0.10.0. The model requirements needed 0.9.2. Downgrade: `pip install librosa==0.9.2`. Tried again.

The progress bar moved again! To 100%! Then… `File "hello.wav" saved.` Heart racing. I clicked the `hello.wav` file. The speaker emitted a voice: "Hello world" with a fairly clear Indonesian accent. Not as smooth as Google, but far better than the eSpeak robot. A small victory that felt like conquering a mountain.

But the task wasn't finished. This only worked in the command prompt with my already messy Python environment. How to make this callable as a service by other applications? We needed to create a wrapper. I made a simple Python script that reads text from a command line argument or file, then generates audio. Then, created a batch file `tts.bat` that calls Python with the correct environment.

New problem: absolute paths. The script needed to know the model location. Couldn't hardcode `C:\Users\me`. Had to be dynamic. I configured it by taking the path based on where the script was run. Then, the library path problem. To be portable, all dependencies had to be findable. Finally, I decided to bundle everything into a complete Python virtual environment, then bundle it with PyInstaller into a single standalone `.exe` executable file.

The `pyinstaller` process itself was another adventure. Needed a special spec file to include the model files and eSpeak DLLs. After several trials and errors, `tts_server.exe` was finally created. A 450 MB file (because it includes the model and PyTorch libraries). But it could be copied to another Windows machine, placed in any folder, and run immediately. Just run `tts_server.exe "Hello world"` and the `output.wav` file appears.

From here, it was one step to integration. Create a simple HTTP server with Flask inside `tts_server.exe`? Possible, but that adds complexity. I chose a simpler solution: the calling application just executes that command line and reads the output file. That's the most primitive and robust "API" possible.

When the larger system finally successfully called the offline TTS and got the desired audio, there was no fanfare. Just an audio file appearing in the `temp` folder. But for me, every syllable produced was proof of countless quiet work: overcoming missing DLLs, version conflicts, path errors, and a finicky model.

Cloud offers the illusion of simplicity: one API call, get results. But it takes a price: dependency, recurring costs, and anxiety about connectivity. Building an offline solution like this returns control. We know every component exactly, we can fix it if it breaks, and most importantly: it will always work as long as Windows is alive, even in a room with zero signal.

Thus, the voice produced is not just audio output, but proof of quiet system work: an independence built from understanding—truly understanding—every compilation failure, every missing library, and every parameter that must be set just right so the machine behind the screen will speak.

FAQ (Casual Q&A)

Q: Why not just use Windows Speech API? It's built-in.
A> Windows Speech API (SAPI) does exist, but its Indonesian voice is very limited and sounds like a Windows XP-era robot. For purposes requiring clarity and naturalness, especially for public audio, it's not good enough. Plus, control over the model and output is minimal.

Q> 450 MB for one EXE file? Isn't that huge?
A> Correct. The size comes from the machine learning library (PyTorch ~400MB) and the voice model. This is the trade-off for independence. No need to install Python or other libraries. The file itself is a complete environment. For enterprise scale, the size is not an issue compared to reliability and compliance with data regulations.

Q: Can this process be automated or made into an installer?
A> Yes. The entire process from installing dependencies, downloading the model, to building with PyInstaller can be written in a PowerShell or Python script. However, building a robust installer for all Windows types (version, architecture, security policy) is another piece of "quiet work" just as complex.

Q: What about model updates or bug fixes?
A> This is the weakness of a self-contained solution. Updates must be done manually: download new model, rebuild executable, deploy to all machines. But on the other hand, this is also a strength: stability. No sudden changes from cloud providers breaking integration. The version that works will keep working for years.

Q: What's the biggest lesson learned from this project?
A> That "offline" and "self-contained" are not features, but an architecture. It requires design considerations from the start: dependency management, packaging, deployment, and maintenance. The biggest quiet work is in the pre-production phase, ensuring all parts connect correctly, long before the user presses the "play" button.

Hajriah Fajar is a multi-talented Indonesian artist, writer, and content creator. Born in December 1987, she grew up in a village in Bogor Regency, where she developed a deep appreciation for the arts. Her unconventional journey includes working as a professional parking attendant before pursuing higher education. Fajar holds a Bachelor's degree in Computer Science from Nusamandiri University, demonstrating her ability to excel in both creative and technical fields. She is currently working as an IT professional at a private hospital in Jakarta while actively sharing her thoughts, artwork, and experiences on various social media platforms.

Thank you for stopping by! If you enjoy the content and would like to show your support, how about treating me to a cup of coffee? �� It’s a small gesture that helps keep me motivated to continue creating awesome content. No pressure, but your coffee would definitely make my day a little brighter. ☕️ Buy Me Coffee

hajriahfajar.com "Think Deep. Build Wise."

Mengganti Cloud dengan Command Prompt: Perjalanan Membangun TTS Offline yang Mandiri di Windows

Mengganti Cloud dengan Command Prompt: Perjalanan Membangun TTS Offline yang Mandiri di Windows

FAQ (Tanya-Jawab Ringan)

Replacing Cloud with Command Prompt: The Journey of Building an Independent Offline TTS on Windows

FAQ (Casual Q&A)

Post a Comment for "Mengganti Cloud dengan Command Prompt: Perjalanan Membangun TTS Offline yang Mandiri di Windows"

Post a Comment