π Originally published in Japanese on Zenn. This is the English version.
Canonical: https://zenn.dev/uya0526_design/articles/satellite1_amivoice-bffπ This is satellite article #1 in my "Read-Aloud Speed Meter dev log" series. For the whole picture, see the main article.
Where This Sits
In the read-aloud speed meter app, this article covers the part that sends browser-recorded audio to the AmiVoice API to get back recognized text plus timestamps. The theme is calling an external API without exposing your API key to the browser β in other words, implementing a BFF (Backend for Frontend).
The main article only touched the highlights, so here I go down to a level you can reproduce yourself. Specifically, four things:
- Why you must not call AmiVoice directly from the browser (why a BFF is needed)
- AmiVoice's synchronous HTTP auth, parameters, and multipart order
- Why the browser's
MediaRecorderoutput (WebM/Opus) passes through as-is - Reshaping the raw JSON with a pure-function mapper and testing it with fixtures
π‘ I'm an ex-Java engineer learning TypeScript in public, so I drop in comparisons to Java here and there.
Why a BFF Is Needed
Using the AmiVoice API requires an API key. And that key must never appear in browser-side code. Frontend JavaScript is fully inspectable by the user, so writing the key there leaks it instantly.
So I insert a relay that holds the key.
[Browser] ββaudio BlobβββΆ [Next.js API Route (BFF / holds key)] βββΆ [AmiVoice API]
record / display audio field reshape into u / d / a speech β text
The browser only calls my own API Route (/api/recognize), and that Route attaches the key server-side and forwards to AmiVoice. The key is just read from process.env and never ends up in the bundle shipped to the browser.
β Java comparison: This is the same as a Spring
@RestControllerreading an external API key fromapplication.yml(env vars) and relaying without showing it to the client. Think "a thin Servlet that hides the secret and relays an external API." In Next.js, theexport async function POSTinapp/api/recognize/route.tscorresponds to@PostMapping.
AmiVoice Synchronous HTTP API Spec
I used the synchronous HTTP interface, implementing against the official manual and double-checking with curl. The key points:
| Item | Detail |
|---|---|
| Endpoint |
POST https://acp-api.amivoice.com/v1/nolog/recognize (no-log version) |
| Auth | Put the API key in the multipart u part (not an Authorization header) |
| Engine | Engine name in d (e.g. -a-general = general conversational) |
| Audio | Binary in a. Must be the final multipart part
|
| Response |
text at the top; per-word starttime/endtime in results[].tokens[]
|
| Error check |
code empty string "" = success; code !== "" = failure |
I hit three traps here. Let me share them in order.
Trap β β Auth is the u field, not a header
When I think "REST API auth," I think Authorization: Bearer .... AmiVoice's synchronous HTTP is different: the API key goes in a multipart field called u. My first version put it in a header, got a 401-style failure, and I stalled here.
Trap β‘ β The audio a goes last
The multipart parts need to be in the order u β d β a (audio), with the audio as the final part. Adding another field after it caused the audio to be ignored. The order you append to FormData carries meaning directly.
Trap β’ β WebM/Opus passes through as-is (c can be omitted)
The browser's MediaRecorder usually outputs audio/webm;codecs=opus. I braced for "I'll probably need to convert the format before handing it to AmiVoice," but WebM + Opus carries a header in the container, so the audio-format parameter c can be omitted on synchronous HTTP (verified with curl). You can send the recording Blob unconverted.
Implementing the API Route (BFF)
Here's app/api/recognize/route.ts with the traps accounted for.
const AMIVOICE_ENDPOINT = "https://acp-api.amivoice.com/v1/nolog/recognize";
const AMIVOICE_ENGINE = "-a-general"; // general conversational
export async function POST(req: Request) {
// 1. Receive the audio Blob from the browser under the "audio" field
const inForm = await req.formData();
const audio = inForm.get("audio") as Blob;
// 2. Reshape into u / d / a order for AmiVoice
const outForm = new FormData();
outForm.append("u", process.env.AMIVOICE_API_KEY ?? ""); // auth (not a header)
outForm.append("d", AMIVOICE_ENGINE); // engine
outForm.append("a", audio, "recording.webm"); // audio (always last)
// 3. Forward to synchronous HTTP and return the raw JSON as-is
const res = await fetch(AMIVOICE_ENDPOINT, { method: "POST", body: outForm });
const body = await res.text();
return new NextResponse(body, {
status: res.status,
headers: { "Content-Type": res.headers.get("Content-Type") ?? "application/json" },
});
}
(Simplified. In the real code, a missing AMIVOICE_API_KEY returns 500 and never sends an empty key to AmiVoice.)
The key idea is "two stages of FormData." Browser β my Route receives under the audio field; Route β AmiVoice reassembles into u/d/a.
β Java comparison: It's the same shape as reshaping an inbound DTO into the multipart form for an outbound external API. "The shape you receive and the shape you send are different things" maps directly onto the FormData reshape.
I designed this Route to return AmiVoice's raw JSON almost untouched. Rather than putting formatting in the Route, I leave it to the next mapper (a pure function), separating responsibilities.
Look at the raw JSON with curl first
Before implementing, hitting the endpoint with curl to see the shape of the raw response turned out to be the fastest route in the end.
curl -X POST https://acp-api.amivoice.com/v1/nolog/recognize \
-F u="$AMIVOICE_API_KEY" \
-F d="-a-general" \
-F a=@recording.webm
I pass the API key from an environment variable ($AMIVOICE_API_KEY) so I never hardcode a raw key in command history or code. "Keep the key out of the code" applies from the curl-verification stage onward.
The returned JSON looks roughly like this (excerpted; values are examples):
{"results":[{"tokens":[{"written":"δΈηͺ","spoken":"γγ‘γ°γ","starttime":1080,"endtime":1480},{"written":"θ²·γ£","spoken":"γγ£","starttime":1480,"endtime":1672},{"written":"γ","spoken":"γ","starttime":1720,"endtime":1800}],"text":"δΈηͺθ²·γ£γ"}],"text":"δΈηͺθ²·γ£γ","code":""}
The Mapper: Raw JSON β App Type (a Pure Function)
I reshape the raw JSON into the AmiVoiceResponse type used inside the app. I make this conversion a pure function with no I/O, sitting as a layer outside the measurement logic (calculateMetrics).
interface AmiVoiceResponse {
text: string;
segments: { starttime: number; endtime: number }[];
}
export function mapAmiVoiceResponse(rawResponse: unknown): AmiVoiceResponse {
if (typeof rawResponse !== "object" || rawResponse === null) {
throw new Error("Invalid response");
}
const { text, results } = rawResponse as {
text: string;
results: { tokens: { starttime: number; endtime: number }[] }[];
};
const segments = results[0]?.tokens.map((token) => ({
starttime: token.starttime ?? 0,
endtime: token.endtime ?? 0,
})) ?? [];
return { text, segments };
}
There was one fix here too. My first version referenced raw.result.tokens (singular) and got nothing back; I corrected it to raw.results[0].tokens (a plural array). It's a mistake I could have prevented by looking at the raw JSON with curl first, and it reaffirmed how important it is to "get hold of real data early."
π§ Design decision: You can build segments "per word" or "per utterance." I went with per word, because the gaps (silence) between words then feed into the stagnation rate, letting me measure fluency more granularly.
Testing With Fixtures
Since the mapper is a pure function, it's straightforward to test with Vitest. The key is to save the real data captured with curl as a fixture.
fixtures/
βββ test_01.json # short response (3 tokens)
βββ test_02.json # based on real curl data (9 tokens)
import { describe, test, expect } from "vitest";
import { mapAmiVoiceResponse } from "./mapAmiVoiceResponse";
import raw01 from "../../fixtures/test_01.json";
describe("mapAmiVoiceResponse", () => {
test("builds segments from tokens and extracts text", () => {
const result = mapAmiVoiceResponse(raw01);
expect(result.segments).toHaveLength(3);
// Not just "the shape matches" β pin the concrete values
expect(result.segments[0].starttime).toBe(1080);
expect(result.segments[0].endtime).toBe(1480);
expect(result.text).toBe("δΈηͺθ²·γ£γ");
});
});
What I was conscious of here is a lesson from a past project: "a passing test β behaving as intended." If you only compare the result of .map() against another .map(), you end up writing the test with the same logic as the mapper, which doesn't verify intent. So by hardcoding concrete values like starttime: 1080 and asserting on them, I locked down "is it really pulling the correct values?"
β Java comparison: Fixtures are like JUnit test resources (JSON under
src/test/resources);toHaveLength/ concrete-value asserts correspond toassertEquals. The idea of "pinning production-like raw JSON into your tests" carried over directly.
What I Implemented Myself / What I Asked AI For
This is AI-collaborative development, so for transparency I write out the split.
| Area | Detail |
|---|---|
| My decisions / implementation | Adopting the BFF structure, the u/d/a order, the raw-JSON-return policy, choosing per-word segments, concrete-value asserts in fixtures, checking the official manual, curl verification |
| Asked AI for | The POST handler skeleton, fetch / FormData boilerplate, a first-draft mapper example, examples of how to write the tests |
| Fixed on AI's pointer |
result β results[0], Authorization header β u field, a weak .map()-only test β concrete-value asserts |
AI provided boilerplate examples, but the rewrites to match the official spec, the verification, and the fixes were all mine.
Wrapping Up
This was my record of calling AmiVoice's synchronous HTTP API through a Next.js BFF. The takeaways:
- Keep the API key out of the browser. Use a BFF that relays through Next.js API Routes.
- For synchronous HTTP, auth is the
ufield, the audioais the final part, and WebM/Opus lets you omitc. - Reshape the raw JSON with a pure-function mapper, and test it with a fixture of real curl data + concrete-value asserts.
Most of my stumbles were ones I could have prevented by looking at the raw response first. When working with an external API, hit it once with curl and look at the raw JSON with your own eyes before implementing β that order turned out to be the fastest.
The detailed development log is in the repository's LEARNING_LOG_Phase1_Step3.md.
Next time I cover the "two-stage" design that hands the recognized result (settled facts) to Claude Haiku to generate one-line feedback β satellite #2, "Claude Haiku coaching design and the prompt swamp" (https://dev.to/uya0526design/dont-let-claude-haiku-do-the-math-a-two-stage-read-aloud-coach-design-and-the-prompt-swamp-2ihc).
This article is part of my public learning journey using AI tools (Claude / Cursor). The design, tech selection, and implementation decisions are mine, and the code is verified with Vitest. I collaborate with AI on the article's structure, outline, and draft prose, and I review and revise every line before publishing.