We Removed the 'Tap to Speak' Button From Daily Talk. Here's What Happened.
Most AI conversation apps make you tap a microphone every single turn. Real conversations don't work that way. We rebuilt Daily Talk so the mic listens automatically and the AI responds the moment you stop talking — like a phone call, not a walkie-talkie.
Quick answer: If you've ever practiced English with an AI app and felt the rhythm broken by tapping a mic button every turn — you're not imagining it. Walkie-talkie UX is bad for conversation practice because the gap between "I'm done speaking" and "tap to send" is where your fluency dies. SpeakShark now does Daily Talk hands-free: the mic listens automatically after the AI's greeting, detects when you stop speaking, and the AI responds immediately. The button is still there as a fallback. You won't need it.
I'll be honest — we shipped the tap-to-speak version of Daily Talk for nine months before realizing it was the wrong default. The reasons it was wrong are not what I expected.
This post is about what changed and why I think it matters more than any AI model upgrade we've done.
The walkie-talkie problem
Every conversation app I've tried — Speak, ELSA, Quazel, ChatGPT Voice Mode, Replika, even Duolingo's max-tier — uses some flavor of "tap to start, tap to stop." The mental model is a walkie-talkie. You press a button, talk, release, the other party hears you, they respond.
This works fine for short utterances. It breaks for actual conversation.
Here's what I noticed when I watched twenty learners use SpeakShark with tap-to-speak:
1. They paused mid-sentence to think about pressing the button. A learner would say "I think the weather is — (hand reaches for screen) — going to be nice." That hand reach is dead air, and dead air during your own sentence is one of the worst feelings in language learning. It interrupts the flow that you're paying us to help build.
2. They forgot to press stop. The recorder would keep running for 10–15 seconds after they finished talking. The AI eventually got the full audio, but the lag felt like the app froze. Half of them assumed the app was broken.
3. They never used barge-in. If the AI started saying something boring or off-topic, the user couldn't interrupt without tapping a button to start a new recording. Real conversation has interruptions. Walkie-talkie conversation doesn't.
4. The first turn was always awkward. New users land on the session, hear the AI greeting, and then sit in silence for 5–10 seconds because they don't realize they have to tap. We had a "👇 Tap to speak" hint on the button. Even with that, half of new users got stuck on turn 1.
The cumulative effect: conversations that should have flowed like coffee chat felt like submitting form fields one at a time.
What "real conversation" actually requires
I went back to first principles. What does a real phone call look like? Three properties matter:
- Always-on listening. The other person can speak any time. You don't grant them permission per utterance.
- End-of-turn detection. They know you're done because you stopped talking, not because you announced it.
- Interruption. Either party can interrupt at any moment.
The mic-tap UX violates all three. The fix is a technology called Voice Activity Detection (VAD).
What VAD is, and why we'd disabled it before
VAD is a small ML model that runs in the browser and decides, in real time, whether the microphone is hearing speech or silence. When the user starts talking, VAD knows. When they stop for ~400 milliseconds, VAD knows. The app can use those signals to start/stop recording automatically — no button press needed.
We've had VAD code in SpeakShark since launch. We had it ON in the very early IELTS practice version, and learners loved it. Then we disabled it on March 25 because of a real problem: TV/ambient noise was triggering false recordings.
The disable commit message said: "fix: disable VAD auto-loop — user must tap mic each turn (prevents TV/ambient)".
We swung too far. Yes, ambient false-triggers are annoying. But they're annoying for ~10% of users in ~10% of sessions. The other 90% lost the hands-free flow entirely as collateral damage. Bad trade.
We re-enabled VAD this week — but only for Daily Talk.
Why Daily Talk specifically (not Challenges or Role Play)
Different modes need different defaults. Here's how we think about it:
| Mode | VAD auto-loop | Reason |
|---|---|---|
| Daily Talk | ✓ ON | Open conversation, natural flow matters most |
| Challenges | ✗ OFF | Timed prompts — user needs control over when the clock starts |
| Role Play | ✗ OFF | Scenario-driven — turn structure is part of the practice |
| Pronunciation | ✗ OFF | Per-word drills — VAD over-triggers between words |
Daily Talk is the one where conversation IS the point. The other modes have other structures that benefit from explicit turn-taking. So Daily Talk gets auto-VAD; the rest stay tap-to-speak.
The new flow, end to end
Here's what a Daily Talk session looks like as of this week:
1. User picks a teacher (Sarah / James / Emily / Liam). No topic picker.
2. Tap "Start". Page loads.
3. AI greeting plays automatically: "Hi there — it's lovely to meet you.
I'm Sarah. May I ask what your name is?"
4. AI finishes. VAD silently starts listening. No "tap to speak" button needed.
5. User says: "I'm Duy."
6. VAD detects speech onset → recording starts. Words appear on screen
word-by-word as the user speaks (live transcript bubble).
7. User stops talking. VAD detects 400ms of silence → recording stops.
8. AI replies: "Nice to meet you, Duy. So tell me — do you work, or are
you studying at the moment?"
9. User answers. Same flow.
10. Goto step 7 until session ends or user navigates away.
The user never touches the mic button. The mic button is still on screen as a fallback for when VAD doesn't catch the start (which happens in ~5% of sessions, usually noisy environments), but most users go an entire session without using it.
I tested it on my own setup for a full week before we shipped. The thing that struck me — and I want to be careful here because subjective UX claims are cheap — is that my brain stopped anticipating the next tap. I just talked. The AI just talked back. The cognitive overhead of "okay now I need to press the button" vanished.
The Bonus Feature: Live Transcript Bubble
While we were in there rebuilding the flow, we brought back something we'd removed eight weeks ago: the live transcript bubble.
When you speak, your words appear in the chat in real time. Word by word. Browser SpeechRecognition handles the partial transcription; Whisper still does the final accurate one.
We'd disabled this in commit 9ddbf23 because partial transcripts sometimes showed broken words ("um", "uh", fragments). That's true — browser SpeechRecognition is messier than Whisper. But again, we'd over-corrected. The information that your voice is being picked up is worth the occasional ugly word. New users especially need that signal — they need to see the app reacting to their voice, not just trust that something will happen in five seconds.
So the live bubble is back. It shows partial words during recording. When the turn submits, the bubble is replaced by the clean Whisper transcript with errors highlighted.
The high-EQ rewrite
While we were rebuilding the flow, we also rewrote the AI system prompt. The old prompt was, frankly, a casual-friend persona — lots of "Oh wow!" "Hmm interesting!" "Yeah totally!" Reasonable for some learners, condescending for others.
The new prompt targets a thoughtful, emotionally intelligent adult persona. Specifically:
- No robotic affirmations. No "Great answer!" / "Excellent!" / "Wonderful!" after every turn.
- Active listening. The AI references specifics from what you just said, not generic follow-ups.
- Validation before pivot. When you share something hard, the AI acknowledges it before redirecting.
- No slang. No "lol", "bro", "lowkey", "ngl", "vibes". This is IELTS-band-7-to-8 register — natural educated English, not Gen-Z TikTok English.
- Phrasal verbs and idioms woven in naturally — "come across", "end up", "get on with", "right up my alley", "in the long run" — because that's how natives actually speak.
- Sentence variety — simple, compound, complex sentences mixed; cohesive devices like "however", "that said", "on the other hand" used appropriately.
The opening sequence is now deterministic: Turn 1 asks your name, Turn 2 asks your job/studies, Turn 3 invites a brief intro. After that, free conversation. The deterministic open is there because every user gives the same answers to these three questions, so they're easy practice for new learners and they bootstrap rapport for everyone.
What this is and isn't
This is not voice-cloning, real-time AI, or a tech moonshot. VAD has existed for a decade. Live transcript has existed for years. Whisper has existed since 2022. The thing that's new is combining them in defaults that respect how humans actually have conversations.
This is not "perfect" yet. False-positive ambient triggers still happen ~5–10% of the time. If you sit next to a TV running an English movie, the app will hear it and start recording. There's no good client-side fix for that yet. The future versions will let you tap an explicit "Pause auto-mic" button mid-session if you walk into a noisy room.
This is not a replacement for tap-to-speak. The button is still there. It's a fallback. We're not zealots. If you prefer manual control — or you're in Challenges / Role Play / Pronunciation mode — the tap is still the default UX.
Try it
If you've used SpeakShark Daily Talk before, fire up a new session — you'll notice immediately. If you've never tried it, create a free account and just start talking. The AI will greet you, ask your name, and the conversation flows from there.
If you want the hands-free flow without the practice angle — if you just want to talk to someone in English to stay sharp — Daily Talk is that mode. Pick a teacher you like, start a session, talk for ten minutes about nothing. The point is mouth time, not goals.
We'll see in a few weeks of telemetry whether the auto-VAD default holds. My hypothesis: session length goes up, turn count goes up, and the "I felt nervous tapping the button" anxiety we used to hear in user interviews disappears. I'll write a follow-up if I'm wrong.
Either way — the button is no longer in the way.