Spoken AI Transparency Protocol

Amplify the Human Voice,
Don’t Replace It

Every story is treated with respect. Every voice is generated with integrity. Our Commitment to Ethical AI is the foundation of every audiobook we produce.

Zero-Training

Your manuscript is processed in a "Paid Tier" environment. It is never used to train public AI models.

Consent-Based

Voices are built on high-fidelity, professional recordings from consenting actors and public domain content.

Identity Protection

Strict "No-Go" policies and biometric fingerprinting block unauthorized voice cloning and deepfakes.

Read Full Statement of Ethics

Spoken's Commitment to Ethical AI

Spoken, the AI Audiobook Company™, extensively utilizes artificial intelligence technology to help authors generate vivid, compelling audio of their stories. Our use of AI falls into two primary categories:

Large Language Models (LLMs): For the purpose of analyzing and preparing the manuscript for narration.
AI Narration: Utilizing Text-to-Speech (TTS) and Speech-to-Speech (STS).

At Spoken, we believe that AI should amplify the human voice, not replace it. Our mission to create emotive, compelling audiobooks is built on a foundation of Ethical AI, ensuring that every story is treated with respect and every voice is generated with integrity.

Data Privacy: Your Story Stays Yours

We understand that an author’s manuscript is their most valuable intellectual property. Our enterprise-level agreements with Google, ElevenLabs, and Hume.ai include strict "Zero Training" clauses.

No Model Training: Your stories are never used to train or improve the underlying AI models of our partners.
Contractual Firewalls: When your manuscript is analyzed by Gemini, it is processed in a "Paid Tier" environment. Our specific enterprise agreements override standard consumer terms. Partners have no right to retain your audio or text to improve public models.
Exclusive Ownership: Metadata, character analysis, and custom-generated voices remain exclusive to you. They are never added to public libraries.

Ethical Voice Generation (TTS & STS)

We utilize ElevenLabs and Hume.ai for their industry-leading commitment to "Consent-Based" synthetic speech. Unlike "open-source" models that may scrape data indiscriminately, our partners follow rigorous ethical frameworks:

Neural Grapheme-to-Phoneme Mapping: TTS models are built using high-fidelity recordings from consenting professional voice actors and public domain content.
Identity Protection: We use Speech-to-Speech (STS) to map an author’s unique emotional performance onto a digital voice, ensuring the "soul" of the narration is human-driven while maintaining synthetic security.
Traceability: Voices generated through our platform include metadata and digital watermarking to prevent unauthorized deepfakes and ensure the provenance of the audio.

Voices, With Consent and a Future

At Spoken, voices are invited, credited, and compensated. Spoken hosts a growing library of professional voice actors who have each created a consensual, studio-grade voice clone of their own voice. These are real performers, preserving their artistry in a new medium. Authors can choose to narrate their audiobooks using one of these voice actors’ approved voice clones, gaining access to exceptional performance quality while respecting the human behind the voice. Every time an audiobook is produced using a voice actor’s clone, that actor is paid through a voice usage system. Their voice continues working for them, earning, reaching new audiences, and extending their creative presence beyond traditional studio sessions. This model allows voice actors to step into the AI era with agency instead of anxiety, and with ownership instead of replacement. It creates a path for passive income that scales ethically, while keeping performers involved with this new creative economy.

The Mechanism

The first step in any TTS model is turning letters (Graphemes) into sounds (Phonemes). To know that "St." means "Saint" in "St. Jude" but "Street" in "Main St.", the AI uses a linguistic map.

Ethical Sourcing: These phonetic maps are based on standard linguistic rules that have existed for centuries. They are often reinforced by massive open-source datasets like Common Voice (Mozilla) or LibriVox, which consist of thousands of hours of public domain audio. These linguistic rules are reinforced by a mix of licensed, consent-based, and public-domain datasets, therefore models like Gemini and ElevenLabs learn the rules of English without needing to "copy" any specific modern author or actor.

Once the AI knows what phonemes to say, it needs to know how to shape the sound waves. It uses a Spectrogram—a visual representation of the spectrum of frequencies in a sound as it varies with time. Think of a spectrogram as "sheet music" for a voice. It tells the AI the pitch, timbre, and volume required for a specific character's "signature."

The Takeaway: Because the AI learns the mathematical patterns of a voice (the frequency and amplitude) rather than "ripping" an MP3, it can generate entirely new sentences that the original voice actor never actually said. This is a transformative process, not a duplicative one.

Technical Integrity & Creative Protection

Preventing Identity Theft

We maintain a strict stance against unauthorized voice cloning. Our integration with ElevenLabs and Hume.ai includes industry-leading safeguards:

Famous Voice Verification: Our partners employ layered safeguards—including identity verification, usage monitoring, and impersonation-prevention systems—to block unauthorized voice cloning, particularly of public figures.
Exclusive Character Synthesis: When we generate a "Custom Voice" for your character based on our analysis, it is a synthetic hybrid. It is a unique acoustic profile that does not exist in the real world, designed to be distinct, non-derivative, and suitable for exclusive use within your project or brand.

TTS vs. STS: The Emotive Bridge

We use two distinct technologies to ensure your story feels human, not robotic.

Text-to-Speech (TTS): Ideal for consistent, clear narration across long-form series by converting written text directly into audio.
Speech-to-Speech (STS): A human "directs" the AI by speaking the line. The AI keeps the human's emotion and timing but swaps the vocal "skin." The actor provides the soul; the AI provides the vocal timbre.

Google Gemini LLM: Responsible Content Analysis

Before a single word is spoken, we use Google Gemini to perform deep character and thematic analysis. Google’s AI principles are core to our workflow:

Responsible AI: Gemini is trained on vast datasets with integrated Copyright Recitation Filters to prevent the model from outputting copyrighted text from its training data.
Safety Filtering: Gemini models utilize advanced safety layers to detect and mitigate bias, hate speech, or unintended toxic outputs during the analysis phase.
Content Integrity: We use Gemini to extract "Value-Added Metadata" that accurately represents your work without misconstruing its intent.
Human-in-the-Loop: Our AI analysis serves as a "first reader" that prepares metadata for authors to review, ensuring the final audio reflects the author's true vision.

Summary

Ethical AI is not just a policy; it’s our product. By choosing Spoken, you are choosing a partner that respects the craft of writing and the sanctity of the human voice.

Large Language Models (LLMs): For the purpose of analyzing and preparing the manuscript for narration.

AI Narration: Utilizing Text-to-Speech (TTS) and Speech-to-Speech (STS).

Amplify the Human Voice,Don’t Replace It