Retrieval

Unlike simple file downloads, this protocol provides an interface for requesting precise data slices, dynamically assembled and streamed back as optimized stream of Arrow record batches.

The Retrieval Protocol

Accessing data requires specifying the Locator, which defines the topic path, and an optional time range in nanoseconds.

Upon receiving a request, the server performs an index lookup in the metadata cache to identify physical data chunks intersecting the requested time window. This is followed by pruning, discarding chunks outside the query bounds to avoid redundant I/O. Once relevant segments are identified, the server streams the data by opening underlying files and delivering it in a high-throughput pipeline.

In the protocol, the get_flight_info call returns a list of resources, each containing an endpoint (the name of the topic or sequence, such as my_sequence or my_sequence/my/topic) and a ticket, an opaque binary blob used by the server in the do_get call to extract and stream the data.

Calling get_flight_info on a sequence returns all topics associated with that sequence, whereas calling it on a specific topic returns only the endpoint and ticket for that topic.

retrieval_protocol
# Define the locator and the temporal slice
locator = "my_sequence/topic/1"
time_range = (start_ns, end_ns)  # Optional: defaults to full history

# Resolve the locator into endpoint and tickets
flight_info = get_flight_info(locator, time_range)

for endpoint in flight_info.endpoints:
    # Use the ticket to stream the actual data batches
    data_stream = do_get(endpoint.ticket)
    for batch in data_stream:
        process(batch)

Sequence List

To find the list of all sequences available in the system, you can call list_flights with the root locator:

list_sequences
# List all sequences in the system
flight_info = list_flights("") # or list_flights("/")

This will return the list of all sequence resource locators available in the platform, which can then be used to retrieve specific topics or data slices.

Metadata Context Headers

To provide full context, the data stream is prefixed with a schema message containing embedded custom metadata. Mosaico injects context into this header for client reconstruction of the environment.

This includes user_metadata, preserving original project context like experimental tags or vehicle IDs, and the ontology_tag, informing the client of sensor data types (e.g., lidar, camera) for type-safe deserialization.

The serialization_format guides interpretation of the underlying serialization protocol used. Now the supported formats include:

Default: The standard format.
Ragged: Optimized for variable-length lists.
Image: An optimized array format for high-resolution visual data.

The Retrieval Protocol​

Sequence List​

Metadata Context Headers​

The Retrieval Protocol

Sequence List

Metadata Context Headers