AI交互协议

约 6715 字大约 22 分钟

2026-03-02

联犀平台支持通过 MQTT 协议进行设备端 AI 交互，包括纯文字对话、实时语音对话、多模态输入（文字+图片/视频/文件）等多种交互方式。

2026-04-08 现网校对补充
当前 120 测试环境与仓库代码已经验证的事实：
设备侧 AI 交互主入口：
HTTP: http://120.25.49.238:7777
MQTT: tcp://120.25.49.238:1883
会话默认音频参数以服务端为准：
sampleRate=24000
channels=1
frameDuration=60
服务端现已支持 Chat 级默认语音策略（由 ai.yaml 控制）：
ListenMode
InterruptMode
AutoEnd
SilenceDurationMs
MaxIdleDurationMs
调用方在 audioStart / listen start 时如果未显式传这些参数，会自动回退到服务端默认值
当前 InterruptMode 已支持：
vad
asr_first_text
asr_final_text
audioStop 当前语义：
关闭本轮 ASR 输入
等待本轮 ASR final
继续走后续 LLM / TTS
不会自动销毁整个会话
当前现网已支持“语音退出意图”：
用户明确说“再见”“结束对话”等时
服务端会先播报告别语
然后主动下发 sessionClosed
设备若要继续对话，需重新 sessionCreate
因此在同一 sessionCreate 之后，设备可以连续多轮执行：
audioStart -> UDP 音频 -> audioStop
audioStart -> UDP 音频 -> audioStop
inputSend
推荐把本文与以下代码一起阅读：
backend/things/service/dmsvr/internal/event/deviceMsgEvent/ai.go
backend/core/service/aisvr/internal/logic/control/listenLogic.go
backend/core/service/aisvr/internal/domain/chat/chatSession.go

1. 协议概述

AI 交互使用独立的 MQTT Topic 类型 ai，与物模型的 property/event/action 类型分离，语义更清晰。

MQTT Topic

上行（设备 → 云端）: $thing/up/ai/{ProductID}/{DeviceName}
下行（云端 → 设备）: $thing/down/ai/{ProductID}/{DeviceName}

通用消息信封

所有 AI 协议消息使用统一的 JSON 信封格式：

{
  "msgToken": "msg-001",
  "method": "sessionCreate",
  "params": {},
  "data": {},
  "code": 200
}

字段	类型	说明
msgToken	string	消息令牌，用于请求-响应匹配
method	string	方法名（小驼峰命名）
params	object	请求参数（上行消息使用）
data	object	响应数据（下行消息使用）
code	int	响应码，200 表示成功（下行消息使用）

2. Agent-Device 绑定机制

设备进行 AI 交互时，会自动关联到对应的智能体（Agent）和数字分身（Clone）。

2.1 AgentGroup 用途分类

AgentGroup 通过 purpose 字段区分用途，不同 purpose 适用于不同的 AI 交互入口：

Purpose	适用场景	触发路径
`user`	用户网页/APP 对话	HTTP `/api/v1/ai/user/completions`
`device`	平台预留（当前未启用）	—
`default`	通用默认	—
`platform`	平台级配置	—

设备 MQTT AI 交互不通过 AgentGroup 路由。设备发起 sessionCreate 时，云端直接使用绑定在设备上的 Clone（见 2.3 节），Clone 已包含对应的 Agent 信息，无需实时查找 AgentGroup。

2.2 产品绑定默认 Agent

通过在产品（Product）上配置 defaultAgentID，该产品下的所有设备在首次绑定时会自动创建对应的 Clone：

配置步骤：

在控制台创建 AgentGroup，设置 purpose=device
在该 Group 下创建 Agent，配置系统 Prompt 和 LLM 参数
创建产品时设置 defaultAgentID，或在产品详情页更新绑定关系

2.3 设备 Clone 自动创建

设备首次被用户绑定时（调用 deviceInfo/bind API），系统会自动执行以下逻辑：

检查设备是否已有 Clone（dm_device_info.clone_id）
若无且产品有 defaultAgentID，自动调用 aisvr 的 Clone RPC 创建 Clone
Clone 代码格式：device_{productID}_{deviceName}（全局唯一）
Clone 绑定到设备，记录 cloneID

Product.defaultAgentID → 设备首次绑定 → 自动创建 Clone → dm_device_info.clone_id

Clone 的作用：

每个设备拥有独立的 Clone，AI 记忆完全隔离
Clone 继承 Agent 的 Prompt 和配置，但记忆独立
不同设备间的 AI 对话历史、偏好设置互不干扰

产品未配置 defaultAgentID 时： 绑定设备时不会自动创建 Clone（clone_id 保持为空）。此时设备发起 sessionCreate 会返回错误（invalidInput，提示设备未绑定 Agent）。如需 AI 交互，须先在产品上配置 defaultAgentID 并重新绑定设备，或通过管理接口为设备手动关联 Clone。

2.4 完整绑定流程

3. 上行方法（设备 → 云端）

Method	说明
`sessionCreate`	创建 AI 会话
`sessionClose`	关闭 AI 会话
`audioStart`	启动音频流
`audioStop`	停止音频流
`respCancel`	取消当前响应（打断）
`inputSend`	发送多模态输入

打断语义补充：
当前服务端不是“检测到有声就一定立刻打断”。实际运行时已区分：
vad：VAD 达到打断门槛后停止 TTS
asr_first_text：ASR 首个有效文本到达后停止 TTS / LLM
asr_final_text：ASR 最终文本到达后停止 TTS / LLM
同时服务端存在一个短暂的播放保护窗，用于减少 TTS 外放回采导致的误打断。

3.1 sessionCreate - 创建会话

请求示例（语音对话）：

{
  "msgToken": "msg-001",
  "method": "sessionCreate",
  "params": {
    "modalities": ["text", "audio"],
    "audioParams": {
      "format": "opus",
      "sampleRate": 24000,
      "channels": 1,
      "frameDuration": 60
    }
  }
}

请求示例（纯文字对话）：

{
  "msgToken": "msg-001",
  "method": "sessionCreate",
  "params": {
    "modalities": ["text"]
  }
}

参数说明：

字段	类型	必填	说明
modalities	[]string	是	请求模态：`text`（纯文字）、`audio`（语音）
audioParams	object	否	音频参数（包含 audio 模态时需要）
audioParams.format	string	否	音频格式：`opus`（推荐）、`pcm`、`wav`、`mp3`
audioParams.sampleRate	int	否	采样率：8000 / 16000 / 24000 / 48000（Hz），当前现网默认 24000
audioParams.channels	int	否	声道数：1（单声道，推荐）、2（立体声）
audioParams.frameDuration	int	否	帧时长（ms）：20 / 40 / 60
transport	string	否	音频传输方式：`udp`（推荐，默认）、`mqtt`
instructions	string	否	本次会话的临时系统 Prompt（覆盖 Agent 默认的 systemPrompt，仅本次有效）

纯文字对话时 modalities 设为 ["text"]，无需 audioParams。
会话创建后有效期为 3600 秒（1 小时），超时服务端自动关闭并下发 sessionClosed。

3.2 inputSend - 发送多模态输入

请求示例（纯文字）：

{
  "msgToken": "msg-002",
  "method": "inputSend",
  "params": {
    "contents": [
      {"type": "text", "text": "你好，请介绍一下自己"}
    ]
  }
}

请求示例（文字+图片）：

{
  "msgToken": "msg-003",
  "method": "inputSend",
  "params": {
    "contents": [
      {"type": "text", "text": "这张图片里有什么？"},
      {"type": "image_url", "imageUrl": "https://oss.example.com/photo.jpg"}
    ]
  }
}

请求示例（文字+图片+音频文件）：

{
  "msgToken": "msg-004",
  "method": "inputSend",
  "params": {
    "contents": [
      {"type": "text", "text": "请结合图片和语音说明现场情况"},
      {"type": "image_url", "imageUrl": "https://oss.example.com/camera.jpg"},
      {"type": "audio", "audioData": "BASE64_ENCODED_AUDIO", "audioFmt": "wav"}
    ],
    "modalities": ["audio"]
  }
}

参数说明：

字段	类型	必填	说明
contents	[]ContentPart	是	多模态内容列表（见第 5 节）
modalities	[]string	否	期望输出模态（不填则使用会话默认值）

补充说明：

contents 当前已验证支持：
- text
- image_url
- audio
audio 类型通过 audioData + audioFmt 传输音频文件内容
当 modalities=["audio"] 且会话已绑定 UDP 音频通道时，服务端会在多模态文本路径下也触发 TTS 下行，而不再局限于纯语音识别路径

3.3 其他上行方法

audioStart、audioStop、respCancel、sessionClose 均无需 params：

{"msgToken": "msg-004", "method": "audioStart"}
{"msgToken": "msg-005", "method": "audioStop"}
{"msgToken": "msg-006", "method": "respCancel"}
{"msgToken": "msg-007", "method": "sessionClose"}

audioStart 可选扩展参数（当前服务端支持）：

{
  "msgToken": "msg-004",
  "method": "audioStart",
  "params": {
    "mode": "auto",
    "realtime_mode": "asr_first_text",
    "auto_end": false,
    "silence_duration_ms": 500,
    "max_idle_duration_ms": 0
  }
}

若不传以上字段，将使用服务端默认值。

Method	触发时机
`audioStart`	设备开始采集麦克风数据，通知云端开始接收 UDP 音频帧
`audioStop`	设备停止采集，通知云端结束本轮语音输入，服务端等待 ASR 输出最终结果（≤1.5s）后触发 LLM 处理
`respCancel`	用户打断 AI 回复（如再次按下按钮），触发 TTS 中止
`sessionClose`	对话结束，释放会话资源

audioStop 与 VAD 自动触发的区别：
手动 audioStop VAD 自动触发
触发方式设备主动发送服务端检测到语音静音边界后自动触发
设备是否需要发包是，发送 audioStop 否，无需设备操作
设备 UDP 发包何时停止发送 audioStop 后可停止发包设备可持续发包（静音帧），VAD 会忽略静音
等待时间服务端等待 ASR 最终结果 ≤1.5s 服务端收到静音边界后直接处理，无需等待
audioStop 后的会话状态： 当前现网实现下，audioStop 只结束“本轮输入”，不会自动关闭整个 session。设备完成一轮 audioStop -> respSttDone -> respTextDone/respAudioDone 后，可以继续下一轮 audioStart 或 inputSend。
respCancel 后的会话状态： 设备收到 audioSpeechInterrupted 后，会话仍然活跃，无需重新 sessionCreate。设备可立即发送 audioStart（开始新一轮语音采集）或 inputSend（发送文字），直接进入下一轮对话。
用户口语退出后的会话状态： 当用户在语音里明确说“再见”“结束对话”“退出对话”等时，服务端会把它识别为“退出意图”：先正常返回一轮告别语，再主动下发 sessionClosed。此时会话已经结束，设备若要继续交互，必须重新 sessionCreate。

	手动 `audioStop`	VAD 自动触发
触发方式	设备主动发送	服务端检测到语音静音边界后自动触发
设备是否需要发包	是，发送 `audioStop`	否，无需设备操作
设备 UDP 发包何时停止	发送 `audioStop` 后可停止发包	设备可持续发包（静音帧），VAD 会忽略静音
等待时间	服务端等待 ASR 最终结果 ≤1.5s	服务端收到静音边界后直接处理，无需等待

3.4 现网调试建议

当设备侧语音链路异常时，推荐按下面的顺序排查：

先跑独立 ASR 能力
- 只验证 respSttDone
- 排除 TTS / MCP / session 生命周期干扰
再跑独立 TTS 能力
- 只验证 respAudioStart / respAudioDone
最后再跑完整 workflow
- 多轮语音
- 设备控制
- 偏好记忆

如果需要对照当前仓库稳定测试基线，可参考：

backend/things/tools/devicesim/integration_test.go
backend/things/test/testdata/README.md
shell/devicesim-oneclick-test.sh

4. 下行方法（云端 → 设备）

Method	说明
`sessionCreated`	会话已创建，返回连接凭证
`sessionClosed`	会话已关闭（响应 sessionClose 或会话超时）
`respCreated`	新一轮 AI 响应开始
`respSttDelta`	STT 增量识别结果（流式）
`respSttDone`	STT 最终识别结果
`respTextDelta`	LLM 增量文本输出（流式）
`respTextDone`	LLM 完整文本输出
`audioSpeechStarted`	TTS 语音播放开始
`audioSpeechStopped`	TTS 语音播放结束
`audioSpeechInterrupted`	TTS 语音被打断（respCancel 触发）
`respAudioStart`	TTS 音频流开始（UDP 下推前通知）
`respAudioDone`	TTS 音频流结束
`respMedia`	媒体内容（图片/视频等）
`error`	错误通知

4.1 sessionCreated / sessionClosed

sessionCreated 在 sessionCreate 成功后下发，包含会话参数和 UDP 连接信息（如适用）。

sessionClosed 在以下两种情况下下发：

设备主动发送 sessionClose，服务端释放会话后回应
会话空闲超过 3600 秒，服务端自动超时关闭
用户在语音里明确表达“结束对话/再见”，服务端播报告别语后主动关闭

{"method": "sessionClosed", "code": 200}

设备收到 sessionClosed 后，会话已完全释放，如需继续对话须重新发送 sessionCreate。

sessionCreated - 会话创建响应

响应示例（语音模式）：

{
  "method": "sessionCreated",
  "msgToken": "msg-001",
  "code": 200,
  "data": {
    "sessionId": "sess-abc123",
    "transport": "udp",
    "modalities": ["text", "audio"],
    "audioParams": {
      "format": "opus",
      "sampleRate": 24000,
      "channels": 1,
      "frameDuration": 60
    },
    "udp": {
      "server": "192.168.1.100",
      "port": 6789,
      "key": "base64-encoded-aes-key==",
      "nonce": "base64-encoded-nonce=="
    },
    "uploadUrl": "/api/v1/things/device/edge/upload-file",
    "supportedModalities": ["text", "audio", "image"]
  }
}

响应示例（纯文字模式）：

{
  "method": "sessionCreated",
  "msgToken": "msg-001",
  "code": 200,
  "data": {
    "sessionId": "sess-xyz789",
    "modalities": ["text"],
    "uploadUrl": "/api/v1/things/device/edge/upload-file"
  }
}

data 字段说明：

字段	类型	说明
sessionId	string	会话 ID，后续所有关联消息的标识
transport	string	实际使用的传输方式（`udp` 或 `mqtt`）
modalities	[]string	本次会话支持的模态列表
audioParams	object	服务端实际使用的音频参数（可能与请求有所调整）
audioParams.format	string	音频格式
audioParams.sampleRate	int	采样率（Hz）
audioParams.channels	int	声道数
audioParams.frameDuration	int	帧时长（ms）
udp	object	UDP 连接信息（仅语音模式返回）
udp.server	string	UDP 服务器地址（IP 或域名）
udp.port	int	UDP 端口号
udp.key	string	AES-CTR 加密密钥（Base64 编码，16 字节）
udp.nonce	string	加密 Nonce（Base64 编码，16 字节）
uploadUrl	string	文件上传端点，用于多模态输入的文件预上传
supportedModalities	[]string	服务端全部支持的模态列表（参考值）

udp.key 和 udp.nonce 用于 AES-CTR 加密音频数据包，详见 UDP 音频通道。

4.2 respCreated - 响应开始

每次 AI 开始生成新一轮响应时下发：

{
  "method": "respCreated",
  "data": {
    "respId": "resp_001"
  }
}

字段	类型	说明
respId	string	响应 ID，用于关联后续的 delta/done 消息

4.3 respTextDelta / respTextDone - 流式文本输出

LLM 回复通过 respTextDelta 逐 token 流式下发，最后通过 respTextDone 发送完整文本：

{"method": "respTextDelta", "data": {"respId": "resp_001", "contentIndex": 0, "delta": "你好"}}
{"method": "respTextDelta", "data": {"respId": "resp_001", "contentIndex": 0, "delta": "，我是"}}
{"method": "respTextDelta", "data": {"respId": "resp_001", "contentIndex": 0, "delta": "AI助手"}}
{"method": "respTextDone",  "data": {"respId": "resp_001", "contentIndex": 0, "text": "你好，我是AI助手"}}

respTextDelta 字段说明：

字段	类型	说明
respId	string	响应 ID（与 respCreated 对应）
contentIndex	int	内容索引（多模态响应时区分不同内容块）
delta	string	本次增量文本片段

respTextDone 字段说明：

字段	类型	说明
respId	string	响应 ID
contentIndex	int	内容索引
text	string	完整的文本内容（所有 delta 的拼接结果）
contents	[]ContentPart	多模态内容（可选，LLM 同时输出图片等时使用）

4.4 respSttDelta / respSttDone - 语音识别结果

{"method": "respSttDelta", "data": {"respId": "resp_001", "delta": "今天天"}}
{"method": "respSttDelta", "data": {"respId": "resp_001", "delta": "今天天气"}}
{"method": "respSttDone",  "data": {"respId": "resp_001", "text": "今天天气怎么样"}}

字段	类型	说明
respId	string	响应 ID
delta	string	增量识别文本（实时更新，可能不稳定）
text	string	最终识别文本（稳定结果，仅 Done 消息有）

4.5 audioSpeechStarted / Stopped / Interrupted - TTS 状态

{"method": "audioSpeechStarted"}
{"method": "audioSpeechStopped"}
{"method": "audioSpeechInterrupted"}

这三个消息均无 data 字段：

audioSpeechStarted：TTS 音频开始从 UDP 下推，设备应开始播放
audioSpeechStopped：TTS 全部播完，设备可进入下一轮监听
audioSpeechInterrupted：用户打断（发送 respCancel），设备应立即停止播放

与 4.6 节 respAudioStart/respAudioDone 的区别：
audioSpeech*（4.5） respAudio*（4.6）
语义层级 播放器状态通知 响应级别音频边界
携带 respId 否是
用途驱动设备播放器行为（开始播/停止播/立即打断）追踪某次 AI 响应的音频流范围，与 respCreated 对应
完整语音回复的消息发送顺序：
audioSpeechStarted       ← 通知设备"开始播放"
respAudioStart（respId） ← UDP 音频帧即将开始推送
[UDP 音频帧持续下推]
respAudioDone（respId）  ← UDP 最后一帧已推送
audioSpeechStopped       ← 通知设备"可以进入下一轮监听"
设备端通常以 audioSpeechStarted/audioSpeechStopped/audioSpeechInterrupted 驱动播放器状态机，以 respAudioStart/respAudioDone 追踪多轮对话中某次响应的音频边界（如日志记录、多模态同步等）。

	`audioSpeech*`（4.5）	`respAudio*`（4.6）
语义层级	播放器状态通知	响应级别音频边界
携带 respId	否	是
用途	驱动设备播放器行为（开始播/停止播/立即打断）	追踪某次 AI 响应的音频流范围，与 `respCreated` 对应

4.6 respAudioStart / respAudioDone - TTS 音频流

{"method": "respAudioStart", "data": {"respId": "resp_001"}}
{"method": "respAudioDone",  "data": {"respId": "resp_001"}}

字段	类型	说明
respId	string	响应 ID

respAudioStart 在 UDP 音频帧开始下推前发送（紧跟在 audioSpeechStarted 之后），respAudioDone 在最后一帧发送后、audioSpeechStopped 之前发送。与 audioSpeech* 消息的区别及顺序关系详见上方 4.5 节说明。

4.7 respMedia - 媒体内容

{
  "method": "respMedia",
  "data": {
    "respId": "resp_001",
    "mediaType": "image",
    "url": "https://oss.example.com/ai-gen-image.png",
    "mimeType": "image/png",
    "desc": "根据您的描述生成的图片"
  }
}

字段	类型	说明
respId	string	响应 ID
mediaType	string	媒体类型：`image`、`video`、`file`
url	string	媒体资源 URL
mimeType	string	MIME 类型（可选）
desc	string	媒体描述（可选）

4.8 error - 错误通知

{"method": "error", "data": {"code": "invalidInput", "message": "modalities 字段缺失"}}
{"method": "error", "data": {"code": "sessionExpired", "message": "会话已过期，请重新创建", "respId": "resp_001"}}
{"method": "error", "data": {"code": "modelTimeout", "message": "LLM 响应超时"}}

字段	类型	说明
code	string	错误码（见第 7 节）
message	string	可读的错误描述
respId	string	关联的响应 ID（可选，错误发生在特定响应时携带）

5. ContentPart 多模态内容类型

inputSend 的 contents 数组中每个元素为一个 ContentPart：

type	说明	关键字段	示例
`text`	纯文本	text	`{"type":"text","text":"你好"}`
`image_url`	图片 URL	imageUrl	`{"type":"image_url","imageUrl":"https://..."}`
`audio`	音频数据	audioData, audioFmt	`{"type":"audio","audioData":"base64...","audioFmt":"wav"}`
`video_url`	视频 URL	videoUrl	`{"type":"video_url","videoUrl":"https://..."}`
`file_url`	文件 URL	fileUrl, fileMime	`{"type":"file_url","fileUrl":"https://...","fileMime":"application/pdf"}`

完整字段说明：

字段	类型	说明
type	string	内容类型（必填）
text	string	文本内容（type=text 时使用）
imageUrl	string	图片 URL（type=image_url 时使用）
audioData	string	音频数据（Base64 编码，type=audio 时使用）
audioFmt	string	音频格式（type=audio 时使用，如 wav、mp3）
videoUrl	string	视频 URL（type=video_url 时使用）
fileUrl	string	文件 URL（type=file_url 时使用）
fileMime	string	文件 MIME 类型（type=file_url 时使用）
fileName	string	文件名（可选，便于展示）

文件预上传

设备端可通过以下接口上传文件（图片/视频/文档），获取 URL 后在 inputSend 中引用。

完整接口规范参见：设备文件上传

URL：/api/v1/things/device/edge/upload-file
Method：POST
Content-Type：multipart/form-data
认证：Basic Auth，使用 MQTT 格式生成的账号密码

curl --location --request POST 'http://{host}/api/v1/things/device/edge/upload-file' \
  --header 'Authorization: Basic {base64(mqtt_username:mqtt_password)}' \
  --form 'file=@"/path/to/photo.jpg"'

响应示例：

{
  "code": 200,
  "msg": "success",
  "data": {
    "filePath": "edge/{productID}/{deviceName}/250225/125609/photo.jpg",
    "fileUri": "https://xxx.oss.com/edge/{productID}/{deviceName}/250225/125609/photo.jpg"
  }
}

上传成功后，将 data.fileUri 作为 imageUrl 或 fileUrl 填入 inputSend 的 contents 中。

禁止上传 html、php、jsp 等危险文件类型。MQTT 账号密码格式参见 MQTT 认证。

6. UDP 音频通道

当 sessionCreate 的 transport 为 udp（默认）时，语音数据通过 UDP 传输而非 MQTT，以降低延迟。

UDP 通道建立流程：

设备发送 sessionCreate（含 transport: "udp"）
云端返回 sessionCreated，其中 udp 字段包含连接信息
设备建立 UDP Socket，记录 server、port、key、nonce
设备发送 audioStart，开始上传加密音频帧
云端通过同一 UDP 通道下推 TTS 音频帧

UDP 音频包需使用 AES-CTR 加密，完整的包格式、加密规范和音频编码要求，请参见：

→ UDP 音频通道协议详细规范

7. AI 物模型联动

在 AI 对话过程中，如果用户请求控制设备（如"打开客厅的灯"、"调低温度到 22 度"），AI 服务会通过物模型 Action 下行来执行设备控制，而不是直接操作硬件。

7.1 联动原理

用户语音/文字
    ↓
AI 解析意图 + 调用 MCP 工具
    ↓
AI 服务生成 thingAction 消息
    ↓
dmsvr 通过物模型 action 通道下推到设备
    ↓
设备执行动作，通过 thing_action_reply 上报结果
    ↓
AI 服务接收结果，生成最终回复文本

7.2 下行消息路由

handleAiDownlinkMsg 函数处理来自 AI 服务的下行消息，支持三类 kind：

kind	说明	下推通道
`thingEvent`	AI 触发设备事件	`$thing/down/thing/{ProductID}/{DeviceName}`（event）
`thingAction`	AI 调用设备行为	`$thing/down/thing/{ProductID}/{DeviceName}`（action）
`thingActionResp`	AI 返回行为执行结果	`$thing/down/thing/{ProductID}/{DeviceName}`（action）
`ai_msg`	AI 协议下行消息	`$thing/down/ai/{ProductID}/{DeviceName}`

设备通过标准物模型 Topic 接收 thingAction，执行完毕后通过 $thing/up/thing/.../action 上报结果。

7.3 设备端处理建议

设备端实现时，需要同时订阅以下两个 Topic：

$thing/down/ai/{ProductID}/{DeviceName} — AI 协议消息（会话管理、文本、TTS 状态）
$thing/down/thing/{ProductID}/{DeviceName} — 物模型下行（AI 发起的设备控制）

物模型协议详情参见：物模型协议文档

8. 交互流程

8.1 纯文字对话

8.2 实时语音对话

audioStop 优雅停止路径：设备发送 audioStop 后，服务端会等待 ASR 输出最终识别文本（最多 1.5 秒），再触发 LLM → TTS 完整回复流程。若等待超时仍无识别结果（如输入为静音），会话静默退出本轮响应，不发 respSttDone/respTextDone。
VAD 自动触发路径：服务端检测到语音静音边界时自动结束本轮输入，设备无需发送 audioStop。此后设备可持续向 UDP 通道发送静音帧（服务端会忽略），也可停止发包。VAD 触发后的 LLM → TTS 流程与手动 audioStop 完全相同。

8.3 多模态输入（文字+图片）

8.4 用户打断 AI 回复

8.5 AI 联动设备控制

8.6 完整设备 AI 交互生命周期

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   产品创建   │────>│  Agent 绑定  │────>│  设备绑定   │────>│ 自动创建    │
│ Product     │     │ Product.    │     │ DeviceInfo  │     │ Clone       │
│             │     │ defaultAgent│     │   .bind     │     │             │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘
                                                                  │
                                                                  ▼
┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   AI 响应   │<────│  LLM/TTS    │<────│ aicenter    │<────│ MQTT        │
│ MQTT 下行   │     │ 处理        │     │ RPC 调用    │     │ sessionCreate│
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘

9. 错误码

错误码	说明
`invalidInput`	无效输入（参数缺失、格式错误）
`modelOverloaded`	模型过载
`modelTimeout`	模型超时
`sessionExpired`	会话已过期（需重新 sessionCreate）
`quotaExceeded`	配额超限
`internalError`	内部错误
`unsupportedModality`	不支持的模态类型

错误处理建议：

错误码	设备端处理方式
`invalidInput`	检查参数格式后重试
`sessionExpired`	重新发送 `sessionCreate` 创建新会话
`modelOverloaded`	等待 1-3 秒后重试
`modelTimeout`	等待 3-5 秒后重试，或通知用户
`quotaExceeded`	停止请求，等待配额恢复
`internalError`	等待后重试，多次失败则上报

10. Go 代码示例

以下示例来自 backend/things/test/mqtt_ai_realdev_test.go 的集成测试，展示完整的设备端 AI 交互流程。

10.1 MQTT 连接（设备认证）

import (
    "gitee.com/unitedrhino/things/share/domain/deviceAuth"
    mqtt "github.com/eclipse/paho.mqtt.golang"
)

// 使用 HMAC-SHA256 生成设备认证信息
clientID, userName, pwd := deviceAuth.GenSecretDeviceInfo(
    deviceAuth.HmacSha256, productID, deviceName, deviceSecret)

opts := mqtt.NewClientOptions()
opts.AddBroker("tcp://your-server:1883")
opts.SetClientID(clientID)
opts.SetUsername(userName)
opts.SetPassword(pwd)
opts.SetAutoReconnect(false)
opts.SetConnectTimeout(10 * time.Second)

mc := mqtt.NewClient(opts)
token := mc.Connect()
token.WaitTimeout(10 * time.Second)
defer mc.Disconnect(500)

10.2 订阅下行消息

type AiMsg struct {
    MsgToken string         `json:"msgToken"`
    Method   string         `json:"method"`
    Code     int            `json:"code,omitempty"`
    Params   any            `json:"params,omitempty"`
    Data     map[string]any `json:"data,omitempty"`
}

msgCh := make(chan AiMsg, 30)
downTopic := fmt.Sprintf("$thing/down/ai/%s/%s", productID, deviceName)
mc.Subscribe(downTopic, 1, func(_ mqtt.Client, m mqtt.Message) {
    var msg AiMsg
    json.Unmarshal(m.Payload(), &msg)
    msgCh <- msg
})

10.3 文字对话完整流程

upTopic := fmt.Sprintf("$thing/up/ai/%s/%s", productID, deviceName)

// 发布上行消息的辅助函数 
publish := func(method string, params any) {
    msg := AiMsg{
        MsgToken: fmt.Sprintf("msg-%d", time.Now().UnixNano()),
        Method:   method,
        Params:   params,
    }
    payload, _ := json.Marshal(msg)
    mc.Publish(upTopic, 1, false, payload).Wait()
}

// 等待指定 method 的下行消息（跳过中间消息如 respTextDelta）
waitMethod := func(ctx context.Context, method string) (AiMsg, error) {
    for {
        select {
        case msg := <-msgCh:
            if msg.Method == method {
                return msg, nil
            }
        case <-ctx.Done():
            return AiMsg{}, fmt.Errorf("等待 %s 超时", method)
        }
    }
}

// Step 1: 创建会话（纯文字模式）
publish("sessionCreate", map[string]any{
    "modalities": []string{"text"},
})

ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
defer cancel()
msg, _ := waitMethod(ctx, "sessionCreated")
sessionID := msg.Data["sessionId"].(string)
fmt.Println("会话已创建:", sessionID)

// Step 2: 发送文字输入
time.Sleep(500 * time.Millisecond)
publish("inputSend", map[string]any{
    "contents": []map[string]any{
        {"type": "text", "text": "物联网是什么？"},
    },
})

// Step 3: 等待 AI 回复（respTextDone 包含完整文本）
ctx2, cancel2 := context.WithTimeout(context.Background(), 60*time.Second)
defer cancel2()
resp, _ := waitMethod(ctx2, "respTextDone")
aiText := resp.Data["text"].(string)
fmt.Println("AI 回复:", aiText)
// 示例输出：物联网（Internet of Things，简称IoT），是把世间万物通过互联网连接起来...

// Step 4: 关闭会话
publish("sessionClose", nil)
ctx3, cancel3 := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel3()
waitMethod(ctx3, "sessionClosed")

10.4 语音对话（UDP 通道）

// Step 1: 创建含音频的会话
publish("sessionCreate", map[string]any{
    "modalities": []string{"text", "audio"},
    "transport":  "udp",
})

ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
defer cancel()
msg, _ := waitMethod(ctx, "sessionCreated")

// Step 2: 提取 UDP 参数
udpInfo := msg.Data["udp"].(map[string]any)
udpServer := udpInfo["server"].(string)          // 服务端地址（由服务端配置决定）
udpPort   := int(udpInfo["port"].(float64))      // UDP 端口，默认 8884
udpKey    := udpInfo["key"].(string)             // Base64 AES 密钥
udpNonce  := udpInfo["nonce"].(string)           // Base64 Nonce
fmt.Printf("UDP 连接信息: %s:%d\n", udpServer, udpPort)

// Step 3: 建立 UDP 连接并发送音频帧（使用 udp_audio.go 中的 UdpAudioClient）
// 参见 backend/things/test/udp_audio.go
udpClient, _ := NewTestUDPClient(udpServer, udpPort, udpKey, udpNonce)
defer udpClient.Close()

// 通知服务端开始接收
publish("audioStart", nil)

// 发送 Opus 音频帧（此处以静音帧为例）
udpClient.SendSilenceFrames(20, 60)  // 20帧 × 60ms = 1.2秒

publish("audioStop", nil)

// Step 4: 发送文字输入或等待 VAD 触发
publish("inputSend", map[string]any{
    "contents": []map[string]any{
        {"type": "text", "text": "测试语音通道，请确认收到"},
    },
})

ctx2, cancel2 := context.WithTimeout(context.Background(), 60*time.Second)
defer cancel2()
resp, _ := waitMethod(ctx2, "respTextDone")
fmt.Println("AI 回复:", resp.Data["text"])

完整可运行代码：backend/things/test/mqtt_ai_realdev_test.go

cd backend/things
TEST_BASE_URL=http://your-server:7777 TEST_MQTT_BROKER=tcp://your-server:1883 \
  go test -v -run "TestMQTTRealDevTextChat|TestMQTTRealDevVoiceChat" ./test/... -timeout 180s

10.5 前置条件：产品/设备创建与 Agent 绑定

Go 端调用 HTTP API 创建产品、设备并绑定 Agent 的完整流程：

// 1. 创建 Agent（返回 agentID）
agentResp := httpPost("/api/v1/ai/agent/info/create", map[string]any{
    "name":         "我的设备智能体",
    "capabilities": []string{"text", "voice"},
})
agentID := agentResp["id"].(float64)

// 2. 创建产品（绑定 defaultAgentID）
// bindLevel 说明：
//   1 = 强绑定（需要设备在线且未被其他用户绑定）
//   2 = 中绑定（无需设备在线，但不允许重复绑定）
//   3 = 弱绑定（无需设备在线，允许重复绑定，适合家庭/多用户场景）
httpPost("/api/v1/things/product/info/create", map[string]any{
    "productID":      "my_product_001",
    "productName":    "智能设备",
    "deviceType":     1,   // 直连设备
    "authMode":       1,   // 密钥认证
    "defaultAgentId": agentID,
    "bindLevel":      3,   // 弱绑定
})

// 3. 创建设备
httpPost("/api/v1/things/device/info/create", map[string]any{
    "productID":  "my_product_001",
    "deviceName": "device_001",
    "projectID":  "2",
})

// 4. 绑定设备（触发自动创建 Clone，设备获得专属 AI 记忆）
httpPost("/api/v1/things/device/info/bind", map[string]any{
    "device": map[string]any{
        "productID":  "my_product_001",
        "deviceName": "device_001",
    },
})

// 5. 获取设备 Secret（用于 MQTT 认证）
infoResp := httpPost("/api/v1/things/device/info/get-one", map[string]any{
    "productID":  "my_product_001",
    "deviceName": "device_001",
})
deviceSecret := infoResp["secret"].(string)
cloneID      := infoResp["cloneID"].(string)  // cloneID 以字符串形式返回（避免 JS 精度丢失）
// cloneID 非空且不为 "0" 表示 Clone 已自动创建，设备拥有独立的 AI 记忆

11. 版本说明

联犀 AI 交互协议使用小驼峰命名的方法（sessionCreate、inputSend、audioStart 等），通过独立的 ai 类型 Topic 传输，与物模型的 property/event/action 通道完全分离。

更新日志

2026/4/16 23:07

查看所有更新日志

4ef2c-docs(ai交互): 同步语音退出意图会话结束语义于 2026/4/16
09d68-docs(ai): sync interaction and udp protocol notes于 2026/4/9
2e886-docs: refresh ai interaction notes于 2026/4/8
e70e5-fix(ai-interaction): 删除 agentId 参数、响应码 0→200、kind 字段改小驼峰于 2026/3/18
8d78e-docs(ai): 补充 AI 交互协议中不明确的字段和行为说明于 2026/3/18
a7760-docs(ai): 补充完善 AI 交互协议文档于 2026/3/18
69dce-docs: 新增设备接入-联犀协议-AI交互文档于 2026/3/12