Text & Audio API: Translation and transcription service

A content-agnostic translation and transcription service. Submit text or audio in your input language, specify a target language and an output format (text or audio), and, if the language pair is supported, receive the translated result back in that format. The API does not interpret content or assume a domain; what you do with it is up to you.

POST /api/v1/text · POST /api/v1/audio Text: user or server · Audio/status: server

What is the Text & Audio API?

The Text and Audio API is a stateless, content-agnostic translation and transcription service. You supply input (text or audio), specify the input language, the target language, and the output format you want back (text or audio). If the language pair is supported, the API returns the translated result. The service does not interpret what you send, there is no agricultural model, no follow-up logic, and no memory between calls.

Two flows are supported: text → text or audio (submit a string, receive translation as text or synthesised speech), and audio → text or audio (submit an audio file, receive transcript optionally translated and synthesised back to speech).

Local dialect coverage is the focus

Built to expand translation and speech support to African and Bantu languages that mainstream tools cover poorly. Production validation on May 13, 2026 returned 200 for synchronous text translation with a user key, 202 for async text job creation with a user key, and 202 for audio upload with a server key. Job polling and locale listing are server-audience routes in the current policy.

Authentication and scopes

All endpoints require a valid API key. Use user keys only for creating text translation requests. Use a server key for audio uploads, locale discovery, and job polling.

X-Api-Key: YOUR_API_KEY

# Bearer style is also accepted
Authorization: Bearer YOUR_API_KEY

Text and audio operations use separate scopes, allowing you to grant minimal permissions to each integration:

Text scope, field:text

  • POST /api/v1/text, Synchronous text translation
  • POST /api/v1/text/jobs, Create async text job
  • GET /api/v1/text/jobs/job_abc123, Check job status with a server key

Audio scope, field:audio

  • POST /api/v1/audio, Upload audio file for transcription/translation
  • GET /api/v1/audio/audio_job_abc123, Check audio job status
  • GET /api/v1/locales, List supported languages

Billing

POST /api/v1/text and POST /api/v1/text/jobs are billed per call under the field_text billing endpoint: 1 credit, or 2 when you request spoken output. POST /api/v1/audio is billed by length: it reserves the worst-case credits when you submit, then settles the exact cost and refunds the difference once the real duration is measured (base 1 + 1 per 30s in + 1 per 30s spoken out). GET endpoints for status polling and locale listing are not billed.

Worked example: a 45-second clip translated to spoken audio with about 20 seconds of speech settles to 4 credits (base 1, plus 2 for the input audio at 1 per 30s, plus 1 for the spoken output). On submit the worst case is reserved; the unused hold is refunded once the real duration is measured.

Synchronous text translation

POST /api/v1/text translates a text string and optionally synthesises it to audio in a single synchronous call. Use this for short strings where low latency matters.

Required fields

  • text (string), The input text to translate
  • lang (string), Target language code (e.g. sw for Swahili, zh for Mandarin). See the supported-languages section below for the full list.
  • output (string), Either "text" or "audio"

Optional fields

  • source_lang (string, default "auto"), Source language; set to "auto" for automatic detection

Example, text output

curl -X POST "https://api.fildraai.com/api/v1/text" \
  -H "X-Api-Key: YOUR_USER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello, how are you today?",
    "lang": "sw",
    "output": "text",
    "source_lang": "en"
  }'

Example response

{
  "ok": true,
  "input_modality": "text",
  "output_modality": "text",
  "source_lang": "en",
  "source_detection": "langdetect",
  "target_lang": "sw",
  "target_display": "Swahili",
  "translation_used": true,
  "translation_backend": "bedrock",
  "text": "Habari, habari yako leo?",
  "audio_url": null,
  "audio_format": null,
  "audio_duration_seconds": null,
  "latency_ms": 1090.5
}

The response includes translation_used (whether translation was applied), text (the output string), and, when output is "audio", an audio_url link and audio_format.

Asynchronous audio jobs

Audio processing, transcription and synthesis, is handled asynchronously via a job queue. This allows large audio files to be processed in the background without blocking your application.

Known limitation: spoken-audio download

When you request output: "audio", the job returns an audio_url, but that link is not directly downloadable yet. Until this is fixed, use output: "text" if you need the result programmatically. Transcription and translation text are unaffected.

Upload audio

POST /api/v1/audio accepts a multipart form upload with the audio file and lang and output form fields. Returns a job_id immediately.

Poll for status

GET /api/v1/audio/audio_job_abc123 returns the current job status. Replace audio_job_abc123 with the job_id returned by the upload request. When status is DONE, the response includes the transcript, translation, and (if synthesis was requested) an audio URL. Note: the synthesized audio URL is currently not directly downloadable; prefer text output until this is fixed.

Accepted formats

Common audio container formats (MP3, M4A, WAV, OGG, WebM, FLAC). The maximum file size is 15 MB, but the effective limit is your plan's audio duration cap (Free 60s, Basic 300s, Pro 900s); a longer clip is rejected even when the file is small. Live in-browser recordings from the FieldAudio playground are capped at 10 seconds. Shorter clips process faster.

Async text jobs

For longer text that needs audio synthesis, use POST /api/v1/text/jobs to submit a text translation job asynchronously. Retrieve the result through your backend with GET /api/v1/text/jobs/job_abc123.

Example, upload audio (cURL)

curl -X POST "https://api.fildraai.com/api/v1/audio" \
  -H "X-Api-Key: YOUR_SERVER_API_KEY" \
  -F "[email protected]" \
  -F "lang=sw" \
  -F "output=text" \
  -F "source_lang=auto"

Python, upload audio (requests)

import requests

url = "https://api.fildraai.com/api/v1/audio"
headers = {"X-Api-Key": "YOUR_SERVER_API_KEY"}

# `files` triggers multipart/form-data automatically.
with open("/absolute/path/to/recording.wav", "rb") as fh:
    response = requests.post(
        url,
        headers=headers,
        data={"lang": "sw", "output": "text", "source_lang": "auto"},
        files={"file": ("recording.wav", fh, "audio/wav")},
        timeout=60,
    )

response.raise_for_status()
job = response.json()
print("job_id:", job["job_id"])

# Then poll /api/v1/audio/{job_id} until status is "DONE".

JavaScript, upload audio (fetch + FormData)

// Works in browsers and Node 18+ (built-in fetch + FormData).
// `audioFile` is a File or Blob, from <input type="file"> or recorded audio.
async function uploadAudio(audioFile) {
  const form = new FormData();
  form.append('file', audioFile, audioFile.name || 'recording.wav');
  form.append('lang', 'sw');
  form.append('output', 'text');
  form.append('source_lang', 'auto');

  const response = await fetch('https://api.fildraai.com/api/v1/audio', {
    method: 'POST',
    headers: { 'X-Api-Key': 'YOUR_SERVER_API_KEY' },
    // IMPORTANT: do NOT set Content-Type, the browser sets the
    // correct multipart boundary automatically when you pass FormData.
    body: form,
  });

  if (!response.ok) throw new Error(`HTTP ${response.status}`);
  const job = await response.json();
  return job.job_id; // poll /api/v1/audio/{job_id} for completion
}

Test in Postman

Postman is the fastest way to verify your API key + audio file work before wiring code. The steps below mirror the cURL / Python / JS examples above.

  1. Create a new request. Method POST, URL https://api.fildraai.com/api/v1/audio.
  2. Add the API key header. Headers tab → add X-Api-Key with your server-audience key value.
  3. Switch the body to multipart. Body tab → select form-data.
  4. Add the audio file + form fields. Create a key file, change its type from Text to File, pick your .wav / .mp3 / .webm. Add three text rows: lang=sw, output=text, source_lang=auto.
  5. Send it. A successful response returns 200 with a job_id. Poll GET /api/v1/audio/{job_id} with the same X-Api-Key header until status: "DONE".

Screenshots below, drop the captured PNGs into /static/img/docs/postman/ with the filenames shown.

Screenshot 1 POST method + audio endpoint URL Show Postman with POST selected and the full URL pasted in. Filename: postman_audio_01_url.png
Screenshot 2 X-Api-Key header (server-audience key) Headers tab with one row visible: X-Api-Key + your masked server-audience key. Filename: postman_audio_02_header.png
Screenshot 3 Body → form-data with audio File row Body tab, form-data selected, the file row toggled to File with a selected WAV, and the three text rows below. Filename: postman_audio_03_formdata.png
Screenshot 4 200 response with job_id Response pane showing 200 status + JSON containing job_id and initial status: "QUEUED". Filename: postman_audio_04_response.png

Tip: save the API key as a Postman environment variable (e.g. {{FILDRA_SERVER_KEY}}) so you don't paste it into every request and so screenshots you share don't leak it.

Polling job status

After submitting an audio or text job, poll the status endpoint from your backend until the job reaches a terminal state. The current production policy requires a server-audience key for status polling.

Active states

  • QUEUED, Job received and waiting to be processed
  • PROCESSING, Transcription or synthesis is in progress

Terminal states

  • DONE, Processing finished; results are available in the response
  • FAILED, Processing failed; check error_code and error_message for details

Example responses

Submitting a job returns 202 with a job_id. Polling GET /api/v1/audio/{job_id} returns 200 at every stage; read status to know where the job is.

# 1. Submit response (HTTP 202)
{ "ok": true, "status": "QUEUED", "job_id": "53aeb56cfb8f4c1fb5c2d64198f2f9be" }

# 2. Completed job (HTTP 200, status DONE)
{
  "status": "DONE",
  "job_id": "53aeb56cfb8f4c1fb5c2d64198f2f9be",
  "input_modality": "audio",
  "output_modality": "text",
  "source_lang": "en",
  "target_lang": "sw",
  "transcript_text": "Good morning, the maize in the lower field is turning yellow.",
  "text": "Habari za asubuhi, mahindi katika shambani la chini yanabadilika kuwa manjano.",
  "audio_url": null,
  "audio_duration_seconds": null,
  "error_code": null,
  "error_message": null
}

# 3. Failed job (HTTP 200, status FAILED)
{
  "status": "FAILED",
  "job_id": "…",
  "error_code": "adapter_job_failed",
  "error_message": "Audio exceeds the maximum duration for your plan."
}

When you requested output: "audio", a DONE job also carries audio_url, audio_format, and audio_duration_seconds. Note: the synthesized audio_url is not directly downloadable yet (see the warning above); prefer text output until that is fixed.

Polling interval

For short audio messages, start polling after 3–5 seconds. Use exponential backoff, most jobs complete within 10–30 seconds. Avoid polling more frequently than once per 2 seconds to stay within rate limits.

Supported languages

Use GET /api/v1/locales to retrieve the current list of supported languages at runtime. Each locale entry includes whether audio synthesis (audio_supported) is available, since not all text locales have voice synthesis.

Showing the built-in baseline (live list unavailable right now). 14 languages translate text; 14 also speak back with output: "audio".

Language Code Text translation Audio (TTS)
Yes Yes
Yes Yes
Yes Yes
Yes Yes
Yes Yes
Yes Yes
Yes Yes
Yes Yes
Yes Yes
Yes Yes
Yes Yes
Yes Yes
Yes Yes
Yes Yes

Languages marked No for audio translate text but return 400 if you request output: "audio".

Field name: lang, not target_lang

The request body uses lang as the target language code. source_lang is optional and defaults to auto-detection. Codes are short (en, sw, zh), not regional forms like sw-ke.

Check at runtime

Language support expands over time. Always query /api/v1/locales at runtime rather than hardcoding the supported language list in your application.

Output modes

The output parameter controls what the API returns. Both text and audio endpoints support the same two modes:

output: "text"

Returns the translated text string in the target language. No audio synthesis. Use this when the downstream system will display or speak the text through its own TTS engine.

output: "audio"

Translates the text and synthesises it to speech in the target language. Returns an audio_url to the generated audio file and the audio_format (e.g. mp3).

Error codes

Common error responses from the Text and Audio API:

Client errors (4xx)

  • 400 empty_text, The input text is empty
  • 400 unsupported_target_lang, The target language is not supported
  • 400 audio_unavailable, Audio synthesis not supported for this language
  • 400 invalid_output, output must be "text" or "audio"
  • 401, Missing or invalid API key
  • 404, Job not found (for status polling)

Server errors (5xx)

  • 503, Translation or synthesis service temporarily unavailable

Audio file validation

Audio uploads require a non-empty file. If the file field is missing from the multipart form or the uploaded file is empty, the API returns 400 immediately, before the job is queued. Validate file size and presence before uploading to avoid wasted calls.