Calling AmiVoice's Synchronous HTTP API Through a Next.js BFF — Auth, multipart Order, and the WebM Trap

📝 Originally published in Japanese on Zenn. This is the English version.
Canonical: https://zenn.dev/uya0526_design/articles/satellite1_amivoice-bff

📚 This is satellite article #1 in my "Read-Aloud Speed Meter dev log" series. For the whole picture, see the main article.

Where This Sits

In the read-aloud speed meter app, this article covers the part that sends browser-recorded audio to the AmiVoice API to get back recognized text plus timestamps. The theme is calling an external API without exposing your API key to the browser — in other words, implementing a BFF (Backend for Frontend).

The main article only touched the highlights, so here I go down to a level you can reproduce yourself. Specifically, four things:

Why you must not call AmiVoice directly from the browser (why a BFF is needed)
AmiVoice's synchronous HTTP auth, parameters, and multipart order
Why the browser's MediaRecorder output (WebM/Opus) passes through as-is
Reshaping the raw JSON with a pure-function mapper and testing it with fixtures

💡 I'm an ex-Java engineer learning TypeScript in public, so I drop in comparisons to Java here and there.

Why a BFF Is Needed

Using the AmiVoice API requires an API key. And that key must never appear in browser-side code. Frontend JavaScript is fully inspectable by the user, so writing the key there leaks it instantly.

So I insert a relay that holds the key.

[Browser] ──audio Blob──▶ [Next.js API Route (BFF / holds key)] ──▶ [AmiVoice API]
 record / display   audio field      reshape into u / d / a        speech → text

The browser only calls my own API Route (/api/recognize), and that Route attaches the key server-side and forwards to AmiVoice. The key is just read from process.env and never ends up in the bundle shipped to the browser.

☕ Java comparison: This is the same as a Spring @RestController reading an external API key from application.yml (env vars) and relaying without showing it to the client. Think "a thin Servlet that hides the secret and relays an external API." In Next.js, the export async function POST in app/api/recognize/route.ts corresponds to @PostMapping.

AmiVoice Synchronous HTTP API Spec

I used the synchronous HTTP interface, implementing against the official manual and double-checking with curl. The key points:

Item	Detail
Endpoint	`POST https://acp-api.amivoice.com/v1/nolog/recognize` (no-log version)
Auth	Put the API key in the multipart `u` part (not an `Authorization` header)
Engine	Engine name in `d` (e.g. `-a-general` = general conversational)
Audio	Binary in `a`. Must be the final multipart part
Response	`text` at the top; per-word `starttime`/`endtime` in `results[].tokens[]`
Error check	`code` empty string `""` = success; `code !== ""` = failure

I hit three traps here. Let me share them in order.

Trap ① — Auth is the `u` field, not a header

When I think "REST API auth," I think Authorization: Bearer .... AmiVoice's synchronous HTTP is different: the API key goes in a multipart field called u. My first version put it in a header, got a 401-style failure, and I stalled here.

Trap ② — The audio `a` goes last

The multipart parts need to be in the order u → d → a (audio), with the audio as the final part. Adding another field after it caused the audio to be ignored. The order you append to FormData carries meaning directly.

Trap ③ — WebM/Opus passes through as-is (`c` can be omitted)

The browser's MediaRecorder usually outputs audio/webm;codecs=opus. I braced for "I'll probably need to convert the format before handing it to AmiVoice," but WebM + Opus carries a header in the container, so the audio-format parameter c can be omitted on synchronous HTTP (verified with curl). You can send the recording Blob unconverted.

Implementing the API Route (BFF)

Here's app/api/recognize/route.ts with the traps accounted for.

const AMIVOICE_ENDPOINT = "https://acp-api.amivoice.com/v1/nolog/recognize";
const AMIVOICE_ENGINE = "-a-general"; // general conversational

export async function POST(req: Request) {
  // 1. Receive the audio Blob from the browser under the "audio" field
  const inForm = await req.formData();
  const audio = inForm.get("audio") as Blob;

  // 2. Reshape into u / d / a order for AmiVoice
  const outForm = new FormData();
  outForm.append("u", process.env.AMIVOICE_API_KEY ?? ""); // auth (not a header)
  outForm.append("d", AMIVOICE_ENGINE);                    // engine
  outForm.append("a", audio, "recording.webm");            // audio (always last)

  // 3. Forward to synchronous HTTP and return the raw JSON as-is
  const res = await fetch(AMIVOICE_ENDPOINT, { method: "POST", body: outForm });
  const body = await res.text();
  return new NextResponse(body, {
    status: res.status,
    headers: { "Content-Type": res.headers.get("Content-Type") ?? "application/json" },
  });
}

(Simplified. In the real code, a missing AMIVOICE_API_KEY returns 500 and never sends an empty key to AmiVoice.)

The key idea is "two stages of FormData." Browser → my Route receives under the audio field; Route → AmiVoice reassembles into u/d/a.

☕ Java comparison: It's the same shape as reshaping an inbound DTO into the multipart form for an outbound external API. "The shape you receive and the shape you send are different things" maps directly onto the FormData reshape.

I designed this Route to return AmiVoice's raw JSON almost untouched. Rather than putting formatting in the Route, I leave it to the next mapper (a pure function), separating responsibilities.

Look at the raw JSON with curl first

Before implementing, hitting the endpoint with curl to see the shape of the raw response turned out to be the fastest route in the end.

curl -X POST https://acp-api.amivoice.com/v1/nolog/recognize \
  -F u="$AMIVOICE_API_KEY" \
  -F d="-a-general" \
  -F a=@recording.webm

I pass the API key from an environment variable ($AMIVOICE_API_KEY) so I never hardcode a raw key in command history or code. "Keep the key out of the code" applies from the curl-verification stage onward.

The returned JSON looks roughly like this (excerpted; values are examples):

{"results":[{"tokens":[{"written":"一番","spoken":"いちばん","starttime":1080,"endtime":1480},{"written":"買っ","spoken":"かっ","starttime":1480,"endtime":1672},{"written":"た","spoken":"た","starttime":1720,"endtime":1800}],"text":"一番買った"}],"text":"一番買った","code":""}

The Mapper: Raw JSON → App Type (a Pure Function)

I reshape the raw JSON into the AmiVoiceResponse type used inside the app. I make this conversion a pure function with no I/O, sitting as a layer outside the measurement logic (calculateMetrics).

interface AmiVoiceResponse {
  text: string;
  segments: { starttime: number; endtime: number }[];
}

export function mapAmiVoiceResponse(rawResponse: unknown): AmiVoiceResponse {
  if (typeof rawResponse !== "object" || rawResponse === null) {
    throw new Error("Invalid response");
  }
  const { text, results } = rawResponse as {
    text: string;
    results: { tokens: { starttime: number; endtime: number }[] }[];
  };
  const segments = results[0]?.tokens.map((token) => ({
    starttime: token.starttime ?? 0,
    endtime: token.endtime ?? 0,
  })) ?? [];
  return { text, segments };
}

There was one fix here too. My first version referenced raw.result.tokens (singular) and got nothing back; I corrected it to raw.results[0].tokens (a plural array). It's a mistake I could have prevented by looking at the raw JSON with curl first, and it reaffirmed how important it is to "get hold of real data early."

🧭 Design decision: You can build segments "per word" or "per utterance." I went with per word, because the gaps (silence) between words then feed into the stagnation rate, letting me measure fluency more granularly.

Testing With Fixtures

Since the mapper is a pure function, it's straightforward to test with Vitest. The key is to save the real data captured with curl as a fixture.

fixtures/
├── test_01.json   # short response (3 tokens)
└── test_02.json   # based on real curl data (9 tokens)

import { describe, test, expect } from "vitest";
import { mapAmiVoiceResponse } from "./mapAmiVoiceResponse";
import raw01 from "../../fixtures/test_01.json";

describe("mapAmiVoiceResponse", () => {
  test("builds segments from tokens and extracts text", () => {
    const result = mapAmiVoiceResponse(raw01);
    expect(result.segments).toHaveLength(3);
    // Not just "the shape matches" — pin the concrete values
    expect(result.segments[0].starttime).toBe(1080);
    expect(result.segments[0].endtime).toBe(1480);
    expect(result.text).toBe("一番買った");
  });
});

What I was conscious of here is a lesson from a past project: "a passing test ≠ behaving as intended." If you only compare the result of .map() against another .map(), you end up writing the test with the same logic as the mapper, which doesn't verify intent. So by hardcoding concrete values like starttime: 1080 and asserting on them, I locked down "is it really pulling the correct values?"

☕ Java comparison: Fixtures are like JUnit test resources (JSON under src/test/resources); toHaveLength / concrete-value asserts correspond to assertEquals. The idea of "pinning production-like raw JSON into your tests" carried over directly.

What I Implemented Myself / What I Asked AI For

This is AI-collaborative development, so for transparency I write out the split.

Area	Detail
My decisions / implementation	Adopting the BFF structure, the `u`/`d`/`a` order, the raw-JSON-return policy, choosing per-word segments, concrete-value asserts in fixtures, checking the official manual, curl verification
Asked AI for	The POST handler skeleton, `fetch` / `FormData` boilerplate, a first-draft mapper example, examples of how to write the tests
Fixed on AI's pointer	`result` → `results[0]`, `Authorization` header → `u` field, a weak `.map()`-only test → concrete-value asserts

AI provided boilerplate examples, but the rewrites to match the official spec, the verification, and the fixes were all mine.

Wrapping Up

This was my record of calling AmiVoice's synchronous HTTP API through a Next.js BFF. The takeaways:

Keep the API key out of the browser. Use a BFF that relays through Next.js API Routes.
For synchronous HTTP, auth is the u field, the audio a is the final part, and WebM/Opus lets you omit c.
Reshape the raw JSON with a pure-function mapper, and test it with a fixture of real curl data + concrete-value asserts.

Most of my stumbles were ones I could have prevented by looking at the raw response first. When working with an external API, hit it once with curl and look at the raw JSON with your own eyes before implementing — that order turned out to be the fastest.

The detailed development log is in the repository's LEARNING_LOG_Phase1_Step3.md.

📦 https://github.com/uya0526-design/reading-speed-meter

Next time I cover the "two-stage" design that hands the recognized result (settled facts) to Claude Haiku to generate one-line feedback → satellite #2, "Claude Haiku coaching design and the prompt swamp" (https://dev.to/uya0526design/dont-let-claude-haiku-do-the-math-a-two-stage-read-aloud-coach-design-and-the-prompt-swamp-2ihc).

This article is part of my public learning journey using AI tools (Claude / Cursor). The design, tech selection, and implementation decisions are mine, and the code is verified with Vitest. I collaborate with AI on the article's structure, outline, and draft prose, and I review and revise every line before publishing.