Adapters

Note

rlmesh.adapters is experimental: it may change or disappear. Pin versions; see Compatibility.

rlmesh.adapters derives a model-to-environment IO adapter at runtime from two declarations: an environment tags its observation and action spaces, a model specifies the payload it ingests, and resolve() matches them by role. This replaces most of the per-(model, environment) adapter code you would otherwise write by hand; cases the declarative specs do not cover fall back to an escape hatch (see Known limitations).

The two sides of an eval connect through it: an environment publishes tags, a model declares a spec, and resolve bridges them.

It is opt-in. Nothing here is imported by the core Gymnasium loop, and it needs the NumPy backend (pip install "rlmesh[numpy]").

Tag the environment

An environment tags its observation and action spaces. Tags are sparse: they carry each entry’s semantic role plus the few facts the gymnasium spaces cannot, such as image axis layout or rotation encoding. Keys, widths, dtypes, and bounds are read from the spaces.

import rlmesh.adapters as adapt

tags = adapt.EnvTags(
    observation={
        "wrist_rgb": adapt.ImageTag(role=adapt.IMAGE_PRIMARY),
        "ee_pos": adapt.StateTag(role=adapt.EEF_POS),
        "ee_quat": adapt.StateTag(role=adapt.EEF_ROT, encoding="quat_xyzw"),
        "grip": adapt.StateTag(role=adapt.GRIPPER_POS),
        "goal": adapt.TextTag(),
    },
    action=adapt.ActionLayout(
        adapt.ActionComponent(adapt.ACTION_DELTA_POS, dim=3),
        adapt.ActionComponent(adapt.ACTION_DELTA_ROT, dim=3, encoding="axis_angle"),
        adapt.ActionComponent(adapt.ACTION_GRIPPER, dim=1, range=(-1.0, 1.0)),
        clip=(-1.0, 1.0),
    ),
)

The observation map is keyed by observation path; dotted keys ("agent.eef_pos") traverse nested Dict spaces. Roles are an open vocabulary of strings matched verbatim between tags and specs. RLMesh ships well-known conventions (IMAGE_PRIMARY, EEF_POS, EEF_ROT, …), but any agreed string works.

Flat (non-Dict) observations

Some environments expose a single flat numeric vector with fixed index ranges instead of one key per quantity (Metaworld is the common case). A StateLayout tags that vector. It is the observation-side mirror of ActionLayout: a sequence of StateField slices in order, each naming its role and offsets implied by order. A field with no role is a skip that advances the offset over indices the model does not read.

"proprio": adapt.StateLayout(
    adapt.StateField(adapt.EEF_POS, 3),
    adapt.StateField(adapt.EEF_ROT, 4, encoding="quat_xyzw"),
    adapt.StateField(adapt.GRIPPER_POS, 1),
    adapt.StateField(dim=10),  # object/goal indices the policy reads from pixels
),

When the whole observation is one leaf, pass the StateLayout directly as observation:

adapt.EnvTags(observation=adapt.StateLayout(...), action=adapt.ActionLayout(...))

A model matches purely by role, so the same spec resolves against a flat env and a Dict env with no change.

Specify the model

A model fully specifies the payload it ingests and the action it emits, in its own conventions.

spec = adapt.ModelSpec(
    inputs=(
        adapt.ImageInput("image", role=adapt.IMAGE_PRIMARY, height=224, width=224),
        adapt.StateInput(
            "proprio",
            components=(
                adapt.StateComponent(adapt.EEF_POS),
                adapt.StateComponent(adapt.EEF_ROT, encoding="rot6d"),
                adapt.StateComponent(adapt.GRIPPER_POS),
            ),
            container="list",
        ),
        adapt.TextInput("task"),
    ),
    action=adapt.ActionLayout(
        adapt.ActionComponent(adapt.ACTION_DELTA_POS, dim=3),
        adapt.ActionComponent(adapt.ACTION_DELTA_ROT, dim=6, encoding="rot6d"),
        adapt.ActionComponent(adapt.ACTION_GRIPPER, dim=1, range=(-1.0, 1.0)),
    ),
)

Resolve and apply

resolve() matches the model spec against the tags and the spaces and returns an Adapter. The adapter preprocesses an observation into the model’s input format and postprocesses the model’s action back into the environment’s.

adapter = adapt.resolve(tags, env.observation_space, env.action_space, spec)
print(adapter.describe())  # the exact transformations chosen
payload = adapter.transform_obs(obs)  # env observation -> model input
action = adapter.transform_action(output)  # model output    -> env action

describe() prints what the resolver derived. Here the image is resized, the rotation goes quat_xyzw -> rot6d, the instruction key is remapped (goal -> task), and the 10-dim action is converted rot6d -> axis_angle, sliced, and clipped into the env’s 7-dim action. Resolution fails with an AdapterResolutionError if a model input or action component has no usable counterpart.

Warning

Specs are pure data. Nothing in a tag or spec is ever evaluated as code. The one exception is EntrypointCustomInput, which imports a named module:callable only when you pass resolve(..., trust_entrypoints=True).

Run a model with no glue

The shortest path publishes the tags on the served environment and lets the model resolve the adapter from the contract.

server = rlmesh.EnvServer(env, "127.0.0.1:5555", tags=tags)
server.serve()

EnvServer(tags=...) validates the tags against the environment’s spaces and merges them into the contract metadata (the tag() verb does the same for an environment you serve yourself). A model then resolves from the handshake alone.

from rlmesh.numpy import Model, RemoteEnv

env = RemoteEnv("127.0.0.1:5555")
model = Model(predict_fn, spec=spec)  # predict_fn works in the model's own format
model.run(env, max_episodes=10)

run(env) reads the environment’s contract, resolves the adapter, and wraps predict_fn so it only ever sees the model’s declared payload. To resolve explicitly, use resolve_from_contract() and adapter.wrap_predict(predict_fn).

Frame history

A model that conditions on a short history of frames declares stack=N on an image input. The adapter buffers the last N processed frames host-side and emits them on a new leading axis ((N, H, W, C)), padding the start of an episode with the first frame and clearing the buffer on reset.

ImageInput("image", role=IMAGE_PRIMARY, size=224, stack=4)

The environment still sends one frame per step; nothing extra crosses the wire.

Caution

Frame stacking is host-side state. A spec that sets stack round-trips through to_json, but the native resolution ignores it; stacking happens in the adapter, not the core.

Escape hatches

When a pairing needs logic a declarative spec cannot express, three mechanisms compose, most local first.

Mechanism

Use

InlineCustomInput / EntrypointCustomInput

Compute one payload key from the raw observation; the rest stays spec-driven.

AdapterBase subclass

Add stateful behavior a spec cannot describe (for example temporal ensembling), usually by wrapping a resolved adapter.

Pair override

Replace the adapter for one (model, environment) pairing entirely. No special machinery: keep a registry keyed by the pair and consult it before resolving.

OVERRIDES: dict[tuple[str, str], Callable[[], adapt.AdapterBase]] = {
    ("xvla", "simpler-bridge"): XVLABridgeAdapter,
}

def build_adapter(model_name, env_name, ...):
    if (factory := OVERRIDES.get((model_name, env_name))) is not None:
        return factory()
    return adapt.resolve(...)

The examples/python/vla_adapters example shows all three over several VLA models and environments; examples/python/adapters is the smallest end-to-end serve-and-run loop.

Custom encodings

Rotation encodings are a closed vocabulary, because a spec must resolve on a remote client with no code. For a general, stable convention (a published model’s rot6d_rowmajor), add it first-party on the native RotationEncoding enum so it serializes into the contract and is conformance-tested. For a one-off, declare a CustomEncoding on the nearest base encoding and supply host-side repacking; reach for first-party once you want it matched by role and reused. The Adapters reference covers CustomEncoding, the from_base/to_base boundary, and the resolve-time invariants.

Known limitations

The system targets the manipulation/VLA case: RGB cameras, proprioception, and an instruction. A few things are out of scope for now and fall back to an escape hatch.

Area

Status

Modalities beyond image / state / text

Depth, lidar, and point clouds are not first-class; carry them through an InlineCustomInput or custom AdapterBase.

Tokenization

Stays in the model. TextInput delivers the instruction as a string; tokenize it inside your prediction function. There is intentionally no TokenizerInput.

Rotation encodings

Fixed set: quat_xyzw, quat_wxyz, axis_angle, rot6d, rot6d_rowmajor, euler_xyz. Conventions and how to add one are in Adapters.