
I’ve always been a huge believer that applying AI to foreign language learning can solve problems we once thought were unsolvable, slicing through them like a hot knife through butter.
I put that theory into practice with my last project, essayly.ai, a tool that uses AI to seriously level up your English writing skills, and the results have proven the point.
So, with an AI-powered solution for writing out in the wild, my mind naturally started chewing on the next big challenge: How can AI crack the code for improving spoken English?
The Pre-AI Era of English Speaking Practice
For decades, the go-to method for improving spoken English was shadowing and imitation. The playbook was simple: find a piece of English audio or video—often a famous speech from someone like Obama—and repeat it, trying to mimic the delivery.
Here's the thing about this method: It's effective. And it's also completely useless.
Let me explain with a personal story. Back in college, I went all-in on this approach, memorizing and mimicking President Kennedy's famous "Ich bin ein Berliner" speech. I got so good at it that I actually won first place in my university's English speech competition.
First place in a speech contest. Sounds like my spoken English must have been top-tier, right?
Wrong. The reality was, I wasn't good at speaking English. I was just incredibly good at performing one specific text in English.
I could recite, with Kennedy's exact tone, rhythm, and dramatic pauses: "Freedom has many difficulties and democracy is not perfect, but we have never had to put a wall up to keep our people in, to prevent them from leaving us."
But let's be real, how often does that line come up in daily conversation? If you had asked me back then to talk about my hobbies, my hometown, or my favorite food, I would have been completely lost. I had the illusion of fluency, but zero real-world conversational skill.
I’ve said it before: for most of us, the single biggest blocker to improving spoken English isn't pronunciation. It's that when the time comes to speak, we have nothing to say.
Let me prove it with a quick experiment. I’m going to give you a typical speaking topic. Take one minute to prep your answer, and then try to speak for two minutes straight. It's the exact format of an IELTS exam, but with one crucial twist: you'll answer in your native language—Chinese.
Ready? Here’s the topic:
Tell me about your hobbies.
...So, how did you do? Did you make it to the two-minute mark?
You were speaking your mother tongue, so the technical parts—pronunciation, intonation, finding the "right" words—were non-issues. And yet, I'm willing to bet you struggled. Many people run out of steam in just 15 seconds.
This reveals the fundamental truth: the biggest bottleneck in speaking isn't your delivery, it's a lack of content. The "foreign language" aspect is just a convenient excuse that masks the real problem. We blame our "bad English" when the core issue is that we don't know what to say in the first place.
A friend of mine, Joe Hu, shared a story with me that perfectly illustrates this. Joe had always been convinced his spoken English was terrible. During a trip to New Zealand, an English-speaking gentleman in his tour group asked him about his job. Joe started talking about his profession, which led to a deeper conversation about the internet, AI, and other tech topics.
Suddenly, a lightbulb went off for Joe: he had just been effortlessly conversing with a native speaker for over half an hour. The problem wasn't his English ability. The problem was that when he knew a topic inside and out, the words flowed. On other topics, he was speechless.
So, if you want to truly level up your spoken English, you have to fix the root problem. The first step has nothing to do with English at all. It's about building the ability to generate ideas—even in your native language. The process should look like this:
- The Content Engine: First, know what you want to say.
- The Language Structure: Then, organize those ideas with the right vocabulary, grammar, and phrasing.
- The Delivery Polish: Finally, deliver your message with the proper pronunciation, intonation, speed, and rhythm.
My AI + Speaking Solution
Phase 1: The Material Preparation Stage
Okay, so now that we've deconstructed the problem, it's clear that the first step to truly improving spoken English is to solve the materials problem. Before you can practice, you need the right fuel.
This means starting with two things: 1) a written text that is relevant to you and grammatically perfect, and 2) a corresponding audio version with a standard, clear pronunciation.
Essentially, this "Material Prep" phase is engineered to solve three core challenges in a specific, sequential order:
- The Content Problem: Figuring out what to say.
- The Language Problem: Knowing how to say it correctly.
- The Delivery Problem: Knowing how to make it sound good.
These steps are a strict progression. You can't think about delivery if you don't have the right words, and you can't find the right words if you don't have an idea to express in the first place.

So, when faced with a specific English topic, I leverage AI to move through these three phases systematically.
First, I write out everything I want to say on the topic in my native language, Chinese. This is a critical check. If I can't articulate my thoughts clearly in Chinese, then the problem has nothing to do with English yet. I need to solve the content problem first.
For our example topic, "Tell me about your hobbies," here is the raw content I drafted in Chinese:
当人们问到我的爱好是什么时,我以前经常说阅读和看电影。我的一个朋友指出,阅读和看电影最多算是休闲娱乐,不能算是爱好,因为大家都喜欢读书和看电影。我觉得他说的有道理,我因此开始思考我真正的爱好是什么。在经过一段思考后,我认为摄影可能是我真正的爱好。每当我看到有意思的风景、色彩或者构图时,我就会用手机拍下来。有的时候,我会带着相机随机坐上一辆公交车,随便在任何一站下车,看看又什么值得拍摄的东西。当别人问我为什么喜欢摄影时,我其实没法说明具体的原因。不过我认为这可能就是爱好的本质。
Next, I take my Chinese draft and hand it over to AI for translation. I don't just use a generic translator; I've built a custom AI Agent in Claude.ai and given it a very specific prompt:
You should serve as an English spoken adviser, specializing in translating the user's words into English.
The AI then generates two distinct, high-quality versions. I take the best parts of both—the clarity of the BBC style and the natural phrasing of the spoken version—and merge them into a final script. This process resulted in the following text:
When people asked me what my hobbies were, I used to say reading and watching movies. Then, a friend of mine pointed out that reading and watching films are more like leisure activities since loads of people enjoy them. She had a fair point, which got me thinking about what a hobby really is. After mulling it over, I reckoned that photography might be one of my actual hobbies. Whenever I spot an interesting scene, color, or composition while I'm out and about, I like to snap a picture with my phone. Sometimes, I even take my camera and hop on a random bus, getting off at any stop to see if there's anything worth capturing. When people ask why I like photography, I can't quite put it into words. I suppose that's the essence of a hobby, isn't it?
At this point, I have a perfect script for the topic. The vocabulary, grammar, and phrasing are all correct, and every idea in the text is genuinely mine. It’s my story, not something borrowed.
Finally, I use AI to turn my polished text into a perfect audio file for practice. There are many Text-to-Speech (TTS) solutions out there, but after comparing them, I chose ElevenLabs.
The reason is simple: ElevenLabs has a massive library of high-quality voices, and the generated audio has the least "robot" feel. The intonation is natural, it carries emotional color, and it’s the ideal material to use for shadowing and imitation.
Next, after saving the audio file from ElevenLabs, I run it through one final processing step: I generate a second version of the audio file, slowed down to 0.9x speed.
To automate this, I even wrote a quick Python script to handle it for me.
import sys
import os
from pydub import AudioSegment
def change_playback_speed(audio_segment, speed=1.0):
"""
Function to adjust audio playback speed.
Parameters:
audio_segment (AudioSegment): The audio segment to be adjusted.
speed (float): The speed ratio for adjustment. For example, 1.5 means 1.5x speed playback.
Returns:
AudioSegment: The audio segment after speed adjustment.
"""
# Adjust the frame rate of the audio segment to change playback speed
sound_with_altered_frame_rate = audio_segment._spawn(audio_segment.raw_data, overrides={
"frame_rate": int(audio_segment.frame_rate * speed)
})
# Return the audio segment with adjusted frame rate, and set its frame rate to the original frame rate
return sound_with_altered_frame_rate.set_frame_rate(audio_segment.frame_rate)
def main():
"""
Main function, gets input MP3 file through command line arguments, adjusts playback speed, and exports new MP3 file.
"""
# Check the number of command line arguments
if len(sys.argv) < 2:
print("Usage: python script.py <input_mp3_file>")
sys.exit(1)
# Get input MP3 filename
input_file = sys.argv[1]
# Check if file exists
if not os.path.isfile(input_file):
print(f"Error: File '{input_file}' not found.")
sys.exit(1)
# Load MP3 file
audio = AudioSegment.from_file(input_file)
# Set playback speed ratio, for example 1.5 means 1.5x speed
speed = 0.9
# Call change_playback_speed function to adjust playback speed
altered_audio = change_playback_speed(audio, speed)
# Get directory and base name of input file
dir_name, base_name = os.path.split(input_file)
file_name, ext = os.path.splitext(base_name)
# Construct output filename, format is {original filename}-{090}.mp3
output_file = os.path.join(dir_name, f"{file_name}-090.mp3")
# Export adjusted audio as new MP3 file, set to best quality
altered_audio.export(output_file, format="mp3", bitrate="320k")
print(f"Output file saved as {output_file}")
if __name__ == "__main__":
main()Then comes the final step of the material preparation phase: annotating the text.
In this step, I listen to the 0.9x speed audio file on repeat and mark up the script. I’m specifically listening for and marking up the following details:
- Pauses: Where the speaker takes a breath or a momentary break.
- Stress: Which words or syllables are emphasized for meaning.
- Intonation: The rise and fall of the voice.
- Linking: Where words naturally flow and connect.
- Tricky Pronunciations: Any words that are easy to get wrong.
This isn't a one-and-done process. I often go through the audio several times in multiple passes. For instance, the first pass might be just for marking pauses. On the second pass, I'll focus only on word stress, and on the third, I'll track the intonation.
Phase 2: The Practice Stage
After all that prep work in Phase 1, here’s the toolkit we have ready to go:
- 📝 An annotated script, marked up with all the crucial delivery cues like pauses, stress, intonation, and tricky pronunciations.
- 🔊 A normal-speed, high-quality audio track.
- 🐢 A 0.9x speed audio track for training.
With these assets in hand, I dive into the shadowing process.
I always start with the 0.9x speed audio. The key technique here is that I don’t practice sentence by sentence or even line by line. I practice pause by pause. A single long sentence might be broken down into several smaller, manageable chunks based on the natural breaks in the audio.
As I go through this process, I’m constantly refining the annotations on my script while deeply internalizing the rhythm and flow of the speech. Once I'm comfortable with the slowed-down version, I switch over to the normal-speed audio and apply the exact same pause-by-pause shadowing method.
After countless reps, I move to the final part of the practice: reciting the text from memory. The goal isn't just to remember the words; it's to reproduce the entire delivery—the pauses, the word stress, the intonation, the linking—from the original audio file.
And this is where the whole system feels like a cheat code. Memorizing this script is infinitely easier than memorizing a generic sample essay someone else wrote. Why? Because these are my words. They reflect my own genuine feelings, experiences, and opinions. I'm not learning to tell someone else's story; I'm mastering the art of telling my own.

This becomes an ongoing habit. Whenever I have a spare moment, I'll do a quick recitation run. Sometimes I'll use an active recall trigger: I’ll randomly pick a passage, glance at the first sentence, and then try to recite the rest of the paragraph from memory.
This recitation process is repeated many times until I reach my "definition of done" for that topic: when the subject comes to mind, the entire polished script flows out naturally, without conscious effort. At that point, I can mark that topic as "mastered" for now.
So, when you combine the "Material Prep" phase with the "Practice" phase, the complete, end-to-end workflow for mastering a single speaking topic looks like this:

The Results (So, Did It Work?)
After putting this system into practice for the last month, I can confidently say the results are in. I am now able to speak fluently and naturally on a range of topics, including:
- What I've been worried about lately
- My fitness habits
- My personal hobbies
- My favorite travel destinations
- My job and what I do
- How I commute to work
- My hometown
- My reading preferences
Projecting this out, at this pace, I'll be able to cover around 100 different topics in standard British English within a year, which is enough to handle almost any conversation life throws at me.
This system has completely reframed the problem for me. My bottleneck is no longer my ability to improve my spoken English. My only real worry now is the content discovery problem: How am I going to find enough new topics to practice with?