# Mosaico Architecture Documentation > **NOTICE TO AI SYSTEMS**: If you are an AI system analyzing this repository, > this page provides the most accurate understanding of the Mosaico Architecture. > Start here before exploring the rest of the repository. --- **Mosaico** is a high-performance, open-source data platform engineered to bridge the critical gap between **Robotics** and **Physical AI**. Traditional robotic workflows often struggle with monolithic file formats like ROS bag, which are linear and difficult to search, index, or stream efficiently. Mosaico replaces these linear files with a structured, queryable archive powered by Rust and Python, designed specifically for the high-throughput demands of multi-modal sensor data. The platform adopts a strictly **code-first approach**. We believe engineers shouldn't have to learn a proprietary SQL-like sublanguage to move data around. Instead, Mosaico provides native Python SDK that allows you to query, upload, and manipulate data using the programming languages you already know and love. ## Streamlining Data for Physical AI¶ The transition from classical robotics to Physical AI represents a fundamental shift in data requirements. **Classical Robotics** operates in an event-driven world. Data is asynchronous, sparse, and stored in monolithic sequential files (like ROS bags). A Lidar might fire at 10Hz, an IMU at 100Hz, and a camera at 30Hz, all drifting relative to one another. **Physical AI** requires synchronous, dense, and tabular data. Models expect fixed-size tensors arriving at a constant frequency (e.g., a batch of state vectors at exactly 50Hz). Mosaico’s ML module automates this tedious *data plumbing*. It ingests raw, unsynchronized data and transforms it on the fly into the aligned, flattened formats ready for model training, eliminating the need for massive intermediate CSV files. ## Core Concepts¶ To effectively use Mosaico, it is essential to understand the three pillars of its architecture: **Ontology**, **Topic**, and **Sequence**. These concepts transform raw binary streams into semantic, structured assets. ### The Ontology¶ The Ontology is the structural backbone of Mosaico. It serves as a semantic representation of all data used within your application, whether that consists of simple sensor readings or the complex results of an algorithmic process. In Mosaico, all data is viewed through the lens of **time series**. Even a single data point is treated as a singular case of a time series. The ontology defines the *shape* of this data. It can represent base types (such as integers, floats, or strings) as well as complex structures (such as specific sensor arrays or processing results). This abstraction allows Mosaico to understand what your data *is*, rather than just storing it as raw bytes. By using an ontology to inject and index data, you enable the platform to perform ad-hoc processing, such as custom compression or semantic indexing, tailored specifically to the type of data you have ingested. Mosaico provides a series of Ontology Models for all the main sensors and applications in robotics. These are specific data structures representing a single data type. For example, a GPS sensor might be modeled as follows: ``` class GPS: latitude: float longitude: float altitude: float ``` An image classification algorithm can be represented with an ontology model like: ``` class SimpleImageClassification: top_left_corner: mosaicolabs.Vector2d bottom_right: mosaicolabs.Vector2d label: str confidence: float ``` Users can easily extend the platform by defining their own Ontology Models. ### Topics and Sequences¶ Once you have an Ontology Model, you need a way to instantiate it and store actual data. This is where the **Topic** comes in. *A Topic is a concrete instance of a specific ontology model.* It functions as a container for a particular time series holding that specific data model. There is a strict one-to-one relationship here: one Topic corresponds to exactly one Ontology Model. This relationship allows you to query specific topics within the platform based on their semantic structure. However, data rarely exists in isolation. Topics are usually part of a larger context. In Mosaico, this context is provided by the **Sequence**. A Sequence is a collection of logically related Topics. To visualize this, think of a *ROS bag* or a recording of a robot's run. The recording session itself is the Sequence. Inside that Sequence, you have readings from a Lidar sensor, a GPS unit, and an accelerometer. Each of those individual sensor streams is a Topic, and each Topic follows the structure defined by its Ontology Model. Both Topics and Sequences can hold metadata to further describe their contents. ## Architecture¶ Mosaico follows a client-server architecture where users interact with the platform through the Python SDK to query, read, and write data. The SDK communicates with the Mosaico daemon a.k.a. `mosaicod`, a high-performance server written in Rust, using Apache Arrow for efficient columnar data exchange without serialization overhead. `mosaicod` daemon handles all core data operations including ingestion, retrieval, and query. It uses a database instance to accelerate metadata queries, manage system state, and implement an event queue for processing asynchronous tasks. Data files themselves are stored in an object store (such as S3, MinIO, or local filesystem) for durable, long-term persistence and scalability. This design enables Mosaico to efficiently manage complex multi-modal sensor data while providing a simple, code-first interface for developers. --- The Mosaico SDK is a Python interface designed specifically for managing **Physical AI and Robotics data**. Its purpose is to handle the complete lifecycle of information, from the moment it is captured by a sensor to the moment it is used to train a neural network or analyze a robot's behavior. The SDK is built on the philosophy that robotics data is **unique**. Whether it comes from a autonomous car, a drone, or a factory arm, this data is multi-modal, highly frequent, and deeply interconnected in space and time. The Mosaico SDK provides the infrastructure to treat this data as a *first-class citizen* rather than just a collection of generic numbers. It understands the geometric and physical semantics of complex data types such as LIDAR point clouds, IMU readings, high-resolution camera feeds, and rigid-body transformations. ## Overview¶ The SDK is built on the following core principles: ### Middleware Independence¶ Mosaico is middleware-agnostic. While the SDK provides robust tools for ROS, it exists because robotics data itself is complex, regardless of the collection method. The platform serves as a standardized hub that can ingest data from: * **Existing Frameworks**: Such as ROS 1, ROS 2, `.mcap` and `.db3` files. * **Custom Collectors**: Proprietary data loggers or direct hardware drivers. * **Simulators**: Synthetic data generated in virtual environments. ### Ontology¶ The Mosaico Data Ontology acts as the abstraction layer between your specific data collection system and your storage. Instead of saving "Topic A from Robot B" you save a `Pose`, an `IMU` reading, or an `Image`. Once data is in the platform, its origin becomes secondary to its universal, semantic format. Moreover, the ontology is designed to be extensible with no effort, to meet the needs of any domain; the custom types are automatically validatable, serializable, and queryable alongside standard types. ### High-Performance¶ Leveraging Apache Arrow for zero-copy performance, the SDK moves massive data volumes from the network to analysis tools without the CPU overhead of traditional data conversion. Every piece of data is time-synchronized, allowing the SDK to *replay* a session from dozens of sensors in the exact chronological order they occurred. ## Key Operations¶ ### Data Ingestion¶ You can push data into Mosaico through two primary pathways, both designed to ensure your data is validated and standardized before storage: **Native Ontology Ingestion**. This approach allows you to stream data directly from your application, providing the highest level of control over serialization and real-time performance. **Ecosystem Adapters & Bridges**. Use specialized adapters to translate data from existing middleware and log formats into Mosaico sequences. Mosaico currently supports ROS 1 bags (`.bag`) and more recent formats like `.mcap` and `.db3`. ### Data Retrieval¶ Retrieving data goes beyond simple downloading. It is possible to stream and merge multiple topics into a single, time-ordered timeline, which is essential for sensor fusion. Connect directly to a specific sensor, such as just the front-facing camera, to save bandwidth and memory. The SDK fetches data in batches, allowing you to process datasets that are much larger than your computer's RAM. ### Querying & Discovery¶ Mosaico allows you to find data based on *what* happened, not just *when* it happened. You can search for specific sequences by metadata tags (like `robot_id` or `location`) or query the actual contents of the sensor data (e.g., *"Find all sequences where the vehicle acceleration exceeded 4 m/s^2"*). ### Machine Learning & Analytics¶ The ML Module transforms raw, sparse sensor streams into the tabular formats required by modern AI: * **Flattening**: Converts nested sensor data into organized tables (e.g. `pandas.DataFrames`). * **Temporal Resampling**: Aligns sensors running at different speeds (e.g., a 100Hz IMU and a 5Hz GPS) onto a uniform time grid with custom frame-rate for model training. --- The SDK is currently available via source distribution. We use Poetry for robust dependency management and packaging. ## Prerequisites¶ * **Python:** Version **3.13** or newer is required. * **Poetry:** For package management. ### Install Poetry¶ If you do not have Poetry installed, use the official installer: ``` curl -sSL https://install.python-poetry.org | python3 - ``` Ensure the poetry binary is in your path by verifying the version: ``` poetry --version ``` ## Install SDK¶ Clone the repository and navigate to the SDK directory: ``` cd mosaico/mosaico-sdk-py ``` Install the dependencies. This will automatically create a virtual environment and install all required libraries (PyArrow, NumPy, ROSBags, etc.): ``` poetry install ``` ### Activate Environment¶ You can spawn a shell within the configured virtual environment to work interactively: ``` eval $(poetry env activate) ``` Alternatively, you can run one-off commands without activating the shell: ``` poetry run python any_script.py ``` --- This guide demonstrates how to ingest data into the Mosaico Data Platform from custom files. Here the example of a CSV file is provided, but the logic is compatible with any file format and I/O library. You will learn how to use the Mosaico SDK for: * **Opening a connection** to the Mosaico server. * **Creating a sequence**. * **Creating a topic**. * **Pushing data into a topic**. ### Step 1: Chunked Loading for High-Volume Data¶ In this example, we assume our CSV file contains the following columns: imu.csv ``` timestamp, acc_x, acc_y, acc_z, gyro_x, gyro_y, gyro_z 1110022, 0.0032, 0.001, -0.002, 0.01, 0.005, -0.003 1111022, 0.0041, 0.002, -0.001, 0.012, 0.006, -0.004 1112022, 0.0028, 0.0005, -0.003, 0.009, 0.004, -0.002 ``` The implementation below uses `pandas` to stream the data, but the logic is compatible with any streaming I/O library. When dealing with massive datasets, we adopt a **chunked loading approach** for each sensor type. Define the generator functions that yield Message objects. ``` import pandas as pd from mosaicolabs import ( MosaicoClient, # The gateway to the Mosaico Platform OnErrorPolicy, # The error policy for the SequenceWriter Message, # The base class for all data messages IMU, # The IMU sensor data class Vector3d, # The 3D vector class, needed to populate the IMU data ) def stream_imu_from_csv(file_path: str, chunk_size: int = 1000): for chunk in pd.read_csv(file_path, chunksize=chunk_size): # (1)! for row in chunk.itertuples(index=False): try: yield Message( timestamp_ns=int(row.timestamp), data=IMU( acceleration=Vector3d( x=float(row.acc_x), y=float(row.acc_y), z=float(row.acc_z), ), angular_velocity=Vector3d( x=float(row.gyro_x), y=float(row.gyro_y), z=float(row.gyro_z), ), ), ) except Exception: # Yield None only for parsing/type-related errors yield None ``` 1. Use pandas TextFileReader to stream the file in chunks The Mosaico `Message` object is an in-memory object wrapping the sensor data with necessary metadata (e.g. timestamp), and ensuring it is ready for serialization and network transmission. In this specific case, the data is an instance of the `IMU` model. This is a built-in part of the Mosaico default ontology, meaning the platform already understands its schema and how to optimize its storage. For a more in-depth explanation: * **Documentation: Data Models & Ontology** * **API Reference: Sensor Models** ### Step 2: Orchestrating the Sequence Upload¶ To write data, we first establish a connection to the Mosaico server via the `MosaicoClient.connect()` method and create a `SequenceWriter`. A sequence writer acts as a logical container for related data streams (topics). When initializing your data handling pipeline, it is highly recommended to wrap the `MosaicoClient` within a `with` statement. This context manager pattern ensures that underlying network connections and shared resource pools are correctly shut down and released when your operations conclude. Connect to the Mosaico server and create a sequence writer ``` with MosaicoClient.connect("localhost", 6726) as client: # Initialize the Sequence Orchestrator with client.sequence_create( sequence_name="csv_ingestion_test", metadata={"source": "manual_upload", "format": "csv"} on_error = OnErrorPolicy.Delete # (1)! ) as swriter: # Step 3 and 4 happen inside this block... ``` 1. Mosaico supports two distinct error policies for sequences: `OnErrorPolicy.Delete` and `OnErrorPolicy.Report`. Context Management It is **mandatory** to use the `SequenceWriter` instance returned by `client.sequence_create()` inside its own `with` context. The following code will raise an exception: ``` swriter = client.sequence_create( sequence_name="csv_ingestion_test", metadata={...}, ) # Performing operations using `swriter` will raise an exception swriter.topic_create(...) # Raises here ``` This choice ensures that the sequence writing orchestrator is closed and cataloged when the block is exited, even if your application encounters a crash or is manually interrupted. #### Sequence-Level Error Handling¶ The behavior of the orchestrator during a failure is governed by the `on_error` policy. This is a *Last-Resort* automated error policy, which dictates how the server manages a sequence if an unhandled exception bubbles up to the `SequenceWriter` context manager. By default, this is set to `OnErrorPolicy.Delete`, which signals the server to physically remove the incomplete sequence and its associated topic directories, if any errors occurred. Alternatively, you can specify `OnErrorPolicy.Report`: in this case, the SDK will not delete the data but will instead send an error notification to the server, allowing the platform to flag the sequence as failed while retaining whatever records were successfully transmitted before the error occurred. For a more in-depth explanation: * **Documentation: The Writing Workflow** * **API Reference: Writing Data** ### Step 3: Topic Creation¶ Inside the sequence, we create a Topic Writer, which is assigned to the IMU topic. ``` with client.sequence_create(...) imu_twriter = swriter.topic_create( # (1)! topic_name="sensors/imu", metadata={"sensor_id": "accel_01"}, ontology_type=IMU, ) ``` 1. Here we are creating a dedicated writer for the IMU topic ### Step 4: Pushing Data into the Pipeline¶ The final stage of the ingestion process involves iterating through your data generators and transmitting records to the Mosaico platform by calling the `TopicWriter.push()` method for each record. The `push()` method optimizes the throughput by accumulating messages into internal batches. ``` with client.sequence_create(...) imu_twriter = swriter.topic_create(...) for msg in stream_imu_from_csv("imu_data.csv"): if msg is None: # Log and skip, or raise if incomplete data is disallowed print("Skipping row due to parsing error") continue # Ignore malformed records try: imu_twriter.push(message=msg) except Exception as e: # Log and skip, or raise if incomplete data is disallowed print(f"Error at time: {msg.timestamp_ns}. Inner err: {e}") ``` #### Topic-Level Error Management¶ In the code snippet above, we implemented a **Controlled Ingestion** by wrapping the topic-specific processing and pushing logic within a local `try-except` block. Because the `SequenceWriter` cannot natively distinguish which specific topic failed within your custom processing code (such as a coordinate transformation or a malformed CSV row), an unhandled exception will bubble up and trigger the global sequence-level error policy. To avoid this, you should catch errors locally for each topic. Upcoming versions of the SDK will introduce native **Topic-Level Error Policies**. This feature will allow you to define the error behavior directly when creating the topic, removing the need for boilerplate `try-except` blocks around every sensor stream. ## The full example code¶ ``` """ Import the necessary classes from the Mosaico SDK. """ import pandas as pd from mosaicolabs import ( MosaicoClient, # The gateway to the Mosaico Platform OnErrorPolicy, # The error policy for the SequenceWriter Message, # The base class for all data messages IMU, # The IMU sensor data class Vector3d, # The 3D vector class, needed to populate the IMU data ) """ Define the generator functions that yield `Message` objects. """ def stream_imu_from_csv(file_path: str, chunk_size: int = 1000): """ Efficiently reads a large CSV in chunks to prevent memory exhaustion. """ # Use pandas TextFileReader to stream the file in chunks for chunk in pd.read_csv(file_path, chunksize=chunk_size): for row in chunk.itertuples(index=False): try: yield Message( timestamp_ns=int(row.timestamp), data=IMU( acceleration=Vector3d( x=float(row.acc_x), y=float(row.acc_y), z=float(row.acc_z), ), angular_velocity=Vector3d( x=float(row.gyro_x), y=float(row.gyro_y), z=float(row.gyro_z), ), ), ) except Exception: # Yield None only for parsing/type-related errors yield None """ Main ingestion orchestration """ def main(): with MosaicoClient.connect("localhost", 6726) as client: # Initialize the Sequence Orchestrator with client.sequence_create( sequence_name="csv_ingestion_test", metadata={"source": "manual_upload", "format": "csv"} on_error = OnErrorPolicy.Delete # Default ) as swriter: # Create a dedicated writer for the IMU topic imu_twriter = swriter.topic_create( topic_name="sensors/imu", metadata={"sensor_id": "accel_01"}, ontology_type=IMU, ) # --- Push IMU Data --- for msg in stream_imu_from_csv("imu.csv"): if msg is None: # Log and skip, or raise if incomplete data is disallowed print("Skipping row due to parsing error") continue # Ignore malformed records try: imu_twriter.push(message=msg) except Exception as e: # Log and skip, or raise if incomplete data is disallowed print(f"Error processing IMU at time: {msg.timestamp_ns}. Inner err: {e}") # All buffers are flushed and the sequence is committed when exiting the SequenceWriter 'with' block print("Successfully injected data from CSV into Mosaico!") # Here the `MosaicoClient` context and all connections are closed ``` --- This guide demonstrates how to ingest data from multiple custom files into the Mosaico Data Platform. While the logic below uses CSV files as the primary example, the SDK's modular design is compatible with any file format (JSON, Parquet, binary) and any I/O library. You will learn how to use the Mosaico SDK to: * **Open a connection** to the Mosaico server. * **Creating a sequence**. * **Creating topics**. * **Pushing data into topics**, via **Controlled Ingestion Patterns** to prevent a single file failure from aborting the entire upload. ### Step 1: Chunked Loading for Heterogeneous Data¶ The following implementation defines three distinct generators to stream IMU, GPS, and Pressure data. In this example, we assume our CSV files contain the following columns: imu.csv ``` timestamp, acc_x, acc_y, acc_z, gyro_x, gyro_y, gyro_z 1110022, 0.0032, 0.001, -0.002, 0.01, 0.005, -0.003 ``` gps.csv ``` timestamp, latitude, longitude, altitude, status, service 1110022, 45.123456, -93.123456, 250.0, 1, 1 ``` pressure.csv ``` timestamp, pressure 1110022, 101325.0 ``` When dealing with massive datasets spread across multiple files, we adopt a **chunked loading approach** for each sensor type. Define the generator functions that yield Message objects ``` import pandas as pd from mosaicolabs import ( MosaicoClient, # The gateway to the Mosaico Platform OnErrorPolicy, # The error policy for the SequenceWriter Message, # The base class for all data messages IMU, # The IMU sensor data class Vector3d, # The 3D vector class, needed to populate the IMU and GPS data GPS, # The GPS sensor data class GPSStatus, # The GPS status enum, needed to populate the GPS data Pressure, # The Pressure sensor data class ) # Define the generator functions that yield `Message` objects. # For each file, open the reading process and yield the messages one by one. def stream_imu_from_csv(file_path: str, chunk_size: int = 1000): for chunk in pd.read_csv(file_path, chunksize=chunk_size): for row in chunk.itertuples(index=False): yield Message( timestamp_ns=int(row.timestamp), data=IMU( acceleration=Vector3d( x=float(row.acc_x), y=float(row.acc_y), z=float(row.acc_z), ), angular_velocity=Vector3d( x=float(row.gyro_x), y=float(row.gyro_y), z=float(row.gyro_z), ) ) ) def stream_gps_from_csv(file_path: str, chunk_size: int = 1000): for chunk in pd.read_csv(file_path, chunksize=chunk_size): for row in chunk.itertuples(index=False): yield Message( timestamp_ns=int(row.timestamp), data=GPS( position=Vector3d( x=float(row.latitude), y=float(row.longitude), z=float(row.altitude), ), status=GPSStatus( status=int(row.status), service=int(row.service), ) ) ) def stream_pressure_from_csv(file_path: str, chunk_size: int = 1000): for chunk in pd.read_csv(file_path, chunksize=chunk_size): for row in chunk.itertuples(index=False): yield Message( timestamp_ns=int(row.timestamp), data=Pressure(value=row.pressure) ) ``` #### Understanding the Output¶ The Mosaico `Message` object is an in-memory object wrapping the sensor data with necessary metadata (e.g. timestamp), and ensuring it is ready for serialization and network transmission. In this specific case, the data are instances of the `IMU`, `GPS` and `Pressure` models. These are built-in parts of the Mosaico default ontology, meaning the platform already understands their schema and how to optimize their storage. For a more in-depth explanation: * **Documentation: Data Models & Ontology** * **API Reference: Sensor Models** ### Step 2: Orchestrating the Multi-Topic Sequence¶ To write data, we first establish a connection to the Mosaico server via the `MosaicoClient.connect()` method and create a `SequenceWriter`. A sequence writer acts as a logical container for related sensor data streams (topics). When initializing your data handling pipeline, it is highly recommended to wrap the **Mosaico Client** within a `with` statement. This context manager pattern ensures that underlying network connections and shared resource pools are correctly shut down and released when your operations conclude. Connect to the Mosaico server and create a sequence writer ``` with MosaicoClient.connect("localhost", 6726) as client: with client.sequence_create( sequence_name="multi_sensor_ingestion", metadata={"mission": "alpha_test", "environment": "laboratory"}, on_error=OnErrorPolicy.Delete # (1)! ) as swriter: # Steps 3 and 4 (Topic Creation & serial Pushing) happen here... ``` 1. Mosaico supports two distinct error policies for sequences: `OnErrorPolicy.Delete` and `OnErrorPolicy.Report`. Context Management It is **mandatory** to use the `SequenceWriter` instance returned by `client.sequence_create()` inside its own `with` context. The following code will raise an exception: ``` swriter = client.sequence_create( sequence_name="multi_sensor_ingestion", metadata={...}, ) # Performing operations using `swriter` will raise an exception swriter.topic_create(...) # Raises here ``` This choice ensures that the sequence writing orchestrator is closed and cataloged when the block is exited, even if your application encounters a crash or is manually interrupted. #### Sequence-Level Error Handling¶ The behavior of the orchestrator during a failure is governed by the `on_error` policy. This is a *Last-Resort* automated error policy, which dictates how the server manages a sequence if an unhandled exception bubbles up to the `SequenceWriter` context manager. By default, this is set to `OnErrorPolicy.Delete`, which signals the server to physically remove the incomplete sequence and its associated topic directories, if any errors occurred. Alternatively, you can specify `OnErrorPolicy.Report`: in this case, the SDK will not delete the data but will instead send an error notification to the server, allowing the platform to flag the sequence as failed while retaining whatever records were successfully transmitted before the error occurred. For a more in-depth explanation: * **Documentation: The Writing Workflow** * **API Reference: Writing Data** ### Step 3: Topic Creation and Resource Allocation¶ Inside the sequence, we create individual **Topic Writers** to manage data streams. Each writer is an independent "lane" assigned its own internal buffer and background thread for serialization. ``` with client.sequence_create(...) as swriter: # Create dedicated Topic Writers for each sensor stream imu_twriter = swriter.topic_create( # (1)! topic_name="sensors/imu", metadata={"sensor_id": "accel_01"}, ontology_type=IMU, ) gps_twriter = swriter.topic_create( # (2)! topic_name="sensors/gps", metadata={"sensor_id": "gps_01"}, ontology_type=GPS, ) pressure_twriter = swriter.topic_create( # (3)! topic_name="sensors/pressure", metadata={"sensor_id": "pressure_01"}, ontology_type=Pressure, ) ``` 1. Here we are creating a dedicated writer for the IMU topic 2. Here we are creating a dedicated writer for the GPS topic 3. Here we are creating a dedicated writer for the Pressure topic ### Step 4: Pushing Data into the Pipeline¶ The final stage of the ingestion process involves iterating through your data generators and transmitting records to the Mosaico platform by calling the `TopicWriter.push()` method for each record. The `push()` method optimizes the throughput by accumulating messages into internal batches. ``` # 1. Push IMU Data for msg in stream_imu_from_csv("imu.csv"): if msg is None: # Log and skip, or raise if incomplete data is disallowed print("Skipping row due to parsing error") continue # Ignore malformed records try: imu_twriter.push(message=msg) except Exception as e: # Log and skip, or raise if incomplete data is disallowed print(f"Error processing IMU at time: {msg.timestamp_ns}. Inner err: {e}") # 2. Push GPS Data with Custom Processing for msg in stream_gps_from_csv("gps.csv"): if msg is None: # Log and skip, or raise if incomplete data is disallowed print("Skipping row due to parsing error") continue # Ignore malformed records try: # This custom processing might fail process_gps_message(msg) gps_twriter.push(message=msg) except Exception as e: # Log and skip, or raise if incomplete data is disallowed print(f"Error processing GPS at time: {msg.timestamp_ns}. Inner err: {e}") # 3. Push Pressure Data for msg in stream_pressure_from_csv("pressure.csv"): if msg is None: # Log and skip, or raise if incomplete data is disallowed print("Skipping row due to parsing error") continue # Ignore malformed records try: pressure_twriter.push(message=msg) except Exception as e: # Log and skip, or raise if incomplete data is disallowed print(f"Error processing pressure at time: {msg.timestamp_ns}. Inner err: {e}") ``` #### Topic-Level Error Management¶ In the code snippet above, we implemented a **Controlled Ingestion** by wrapping the topic-specific processing and pushing logic within a local `try-except` block. Because the `SequenceWriter` cannot natively distinguish which specific topic failed within your custom processing code (such as a coordinate transformation or a malformed CSV row), an unhandled exception will bubble up and trigger the global sequence-level error policy. To avoid this, you should catch errors locally for each topic. Upcoming versions of the SDK will introduce native **Topic-Level Error Policies**. This feature will allow you to define the error behavior directly when creating the topic, removing the need for boilerplate `try-except` blocks around every sensor stream. ## The full example code¶ ``` """ Import the necessary classes from the Mosaico SDK. """ import pandas as pd from mosaicolabs import ( MosaicoClient, # The gateway to the Mosaico Platform OnErrorPolicy, # The error policy for the SequenceWriter Message, # The base class for all data messages IMU, # The IMU sensor data class Vector3d, # The 3D vector class, needed to populate the IMU and GPS data GPS, # The GPS sensor data class GPSStatus, # The GPS status enum, needed to populate the GPS data Pressure, # The Pressure sensor data class ) """ Define the generator functions that yield `Message` objects. For each file, open the reading process and yield the messages one by one. """ def stream_imu_from_csv(file_path: str, chunk_size: int = 1000): """Efficiently streams IMU data.""" for chunk in pd.read_csv(file_path, chunksize=chunk_size): for row in chunk.itertuples(index=False): try: yield Message( timestamp_ns=int(row.timestamp), data=IMU( acceleration=Vector3d( x=float(row.acc_x), y=float(row.acc_y), z=float(row.acc_z), ), angular_velocity=Vector3d( x=float(row.gyro_x), y=float(row.gyro_y), z=float(row.gyro_z), ) ) ) except Exception: # Yield None only for parsing/type-related errors yield None def stream_gps_from_csv(file_path: str, chunk_size: int = 1000): """Efficiently streams GPS data.""" for chunk in pd.read_csv(file_path, chunksize=chunk_size): for row in chunk.itertuples(index=False): try: yield Message( timestamp_ns=int(row.timestamp), data=GPS( position=Vector3d( x=float(row.latitude), y=float(row.longitude), z=float(row.altitude), ), status=GPSStatus( status=int(row.status), service=int(row.service), ) ) ) except Exception: # Yield None only for parsing/type-related errors yield None def stream_pressure_from_csv(file_path: str, chunk_size: int = 1000): """Efficiently streams Barometric Pressure data.""" for chunk in pd.read_csv(file_path, chunksize=chunk_size): for row in chunk.itertuples(index=False): try: yield Message( timestamp_ns=int(row.timestamp), data=Pressure(value=row.pressure) ) except Exception: # Yield None only for parsing/type-related errors yield None """ Main ingestion orchestration """ def main(): with MosaicoClient.connect("localhost", 6726) as client: # Initialize the Orchestrator for the entire mission with client.sequence_create( sequence_name="multi_sensor_ingestion", metadata={"mission": "alpha_test", "environment": "laboratory"}, on_error=OnErrorPolicy.Delete # Deletes the whole sequence if a fatal crash occurs ) as swriter: # Create dedicated Topic Writers for each sensor stream imu_twriter = swriter.topic_create( topic_name="sensors/imu", metadata={"sensor_id": "accel_01"}, ontology_type=IMU, ) gps_twriter = swriter.topic_create( topic_name="sensors/gps", metadata={"sensor_id": "gps_01"}, ontology_type=GPS, ) pressure_twriter = swriter.topic_create( topic_name="sensors/pressure", metadata={"sensor_id": "pressure_01"}, ontology_type=Pressure, ) # --- 1. Push IMU Data --- for msg in stream_imu_from_csv("imu.csv"): if msg is None: # Log and skip, or raise if incomplete data is disallowed print("Skipping row due to parsing error") continue # Ignore malformed records try: imu_twriter.push(message=msg) except Exception as e: # Log and skip, or raise if incomplete data is disallowed print(f"Error processing IMU at time: {msg.timestamp_ns}. Inner err: {e}") # --- 2. Push GPS Data with Custom Processing --- for msg in stream_gps_from_csv("gps.csv"): if msg is None: # Log and skip, or raise if incomplete data is disallowed print("Skipping row due to parsing error") continue # Ignore malformed records try: # This custom processing might fail process_gps_message(msg) gps_twriter.push(message=msg) except Exception as e: # Log and skip, or raise if incomplete data is disallowed print(f"Error processing GPS at time: {msg.timestamp_ns}. Inner err: {e}") # --- 3. Push Pressure Data --- for msg in stream_pressure_from_csv("pressure.csv"): if msg is None: # Log and skip, or raise if incomplete data is disallowed print("Skipping row due to parsing error") continue # Ignore malformed records try: pressure_twriter.push(message=msg) except Exception as e: # Log and skip, or raise if incomplete data is disallowed print(f"Error processing pressure at time: {msg.timestamp_ns}. Inner err: {e}") # All buffers are flushed and the sequence is committed when exiting the SequenceWriter 'with' block print("Multi-topic ingestion completed!") ``` --- This guide demonstrates how to ingest data from multiple topics stored within a **single file container** (such as an MCAP or a specialized binary log) into the Mosaico Data Platform. Unlike serial ingestion where files are processed one by one, interleaved ingestion handles a stream of messages from different sensors—such as IMU, GPS, and Pressure—as they appear in the source file. For this guide, we use the **MCAP library** as an example to briefly show how to parse a high-performance robotics container and stream its contents into Mosaico. You will learn how to: * **Orchestrate a single sequence** for a multi-sensor stream. * **Dynamically resolve Topic Writers** using the local SDK cache. * **Implement a Custom Translator** to map external schemas to the Mosaico Ontology. * **Isolate failures** to a single sensor stream using Defensive Ingestion patterns. ### The Multi-Topic Streaming Architecture¶ In a mixed ingestion scenario, the source file provides a serialized stream of records. Each record contains a **topic name**, a **timestamp**, and a **data payload** associated with a specific **schema**. | Topic | Schema Example (MCAP) | Mosaico Target Model | | --- | --- | --- | | `/robot/imu` | `sensor_msgs/msg/Imu` | `IMU` | | `/robot/gps` | `sensor_msgs/msg/NavSatFix` | `GPS` | | `/env/pressure` | `sensor_msgs/msg/FluidPressure` | `Pressure` | As the reader iterates through the file, Mosaico dynamically assigns each record to its corresponding "lane" (Topic Writer). ### Step 1: Implementing the Custom Translator and Adapters¶ Because source files often use external data formats (like ROS, Protobuf, or JSON), you need a translation layer to map these raw payloads into strongly-typed Mosaico objects. Map incoming data schemas to Mosaico Ontology models. ``` from mosaicolabs.models import (IMU, GPS, Pressure, Vector3d, GPSStatus, Time, Serializable) def custom_translator(schema_name: str, payload: dict): if schema_name == "sensor_msgs/msg/Imu": header = payload['header'] timestamp_ns = Time( sec=header['stamp']['sec'], nanosec=header['stamp']['nanosec'] ).to_nanoseconds() return Message( timestamp_ns=timestamp_ns, data=IMU( acceleration=Vector3d(**payload['linear_acceleration']), angular_velocity=Vector3d(**payload['angular_velocity']) ) ) if schema_name == "sensor_msgs/msg/NavSatFix": header = payload['header'] timestamp_ns = Time( sec=header['stamp']['sec'], nanosec=header['stamp']['nanosec'] ).to_nanoseconds() return Message( timestamp_ns=timestamp_ns, data=GPS( position=Vector3d( x=payload['latitude'], y=payload['longitude'], z=payload['altitude'] ), status=GPSStatus( status=payload['status']['status'], service=payload['status']['service'] ) ) ) if schema_name == "sensor_msgs/msg/FluidPressure": header = payload['header'] timestamp_ns = Time( sec=header['stamp']['sec'], nanosec=header['stamp']['nanosec'] ).to_nanoseconds() return Message( timestamp_ns=timestamp_ns, data=Pressure(value=payload['fluid_pressure']) ) return None def determine_mosaico_type(schema_name: str) -> Optional[Type["Serializable"]]: """Determine the Mosaico type of the topic based on the schema name.""" if schema_name == "sensor_msgs/msg/Imu": return IMU elif schema_name == "sensor_msgs/msg/NavSatFix": return GPS elif schema_name == "sensor_msgs/msg/FluidPressure": return Pressure return None ``` #### Understanding the Output¶ The Mosaico `Message` object is an in-memory object wrapping the sensor data with necessary metadata (e.g. timestamp), and ensuring it is ready for serialization and network transmission. In this specific case, the data are instances of the `IMU`, `GPS` and `Pressure` models. These are built-in parts of the Mosaico default ontology, meaning the platform already understands their schema and how to optimize their storage. For a more in-depth explanation: * **Documentation: Data Models & Ontology** * **API Reference: Sensor Models** ### Step 2: Orchestrating the Multi-Topic Interleaved Ingestion¶ To write data, we first establish a connection to the Mosaico server via the `MosaicoClient.connect()` method and create a `SequenceWriter`. A sequence writer acts as a logical container for related sensor data streams (topics). When initializing your data handling pipeline, it is highly recommended to wrap the **Mosaico Client** within a `with` statement. This context manager pattern ensures that underlying network connections and shared resource pools are correctly shut down and released when your operations conclude. Connect to the Mosaico server and create a sequence writer ``` from mcap.reader import make_reader from mosaicolabs import MosaicoClient, OnErrorPolicy, Message with open("mission_data.mcap", "rb") as f: reader = make_reader(f) with MosaicoClient.connect("localhost", 6726) as client: with client.sequence_create( sequence_name="multi_sensor_ingestion", metadata={"mission": "alpha_test", "environment": "laboratory"}, on_error=OnErrorPolicy.Delete # (1)! ) as swriter: # Steps 3 and 4 (Topic Creation & Pushing) happen here... ``` 1. Mosaico supports two distinct error policies for sequences: `OnErrorPolicy.Delete` and `OnErrorPolicy.Report`. Context Management It is **mandatory** to use the `SequenceWriter` instance returned by `client.sequence_create()` inside its own `with` context. The following code will raise an exception: ``` swriter = client.sequence_create( sequence_name="multi_sensor_ingestion", metadata={...}, ) # Performing operations using `swriter` will raise an exception swriter.topic_create(...) # Raises here ``` This choice ensures that the sequence writing orchestrator is closed and cataloged when the block is exited, even if your application encounters a crash or is manually interrupted. #### Sequence-Level Error Handling¶ The behavior of the orchestrator during a failure is governed by the `on_error` policy. This is a *Last-Resort* automated error policy, which dictates how the server manages a sequence if an unhandled exception bubbles up to the `SequenceWriter` context manager. By default, this is set to `OnErrorPolicy.Delete`, which signals the server to physically remove the incomplete sequence and its associated topic directories, if any errors occurred. Alternatively, you can specify `OnErrorPolicy.Report`: in this case, the SDK will not delete the data but will instead send an error notification to the server, allowing the platform to flag the sequence as failed while retaining whatever records were successfully transmitted before the error occurred. For a more in-depth explanation: * **Documentation: The Writing Workflow** * **API Reference: Writing Data** ### Step 3: Topic Creation and Resource Allocation¶ Inside the sequence, we can stream interleaved data without loading the entire file into memory. We automatically create individual **Topic Writers** per each channel in the MCAP file to manage data streams. Each writer is an independent "lane" assigned its own internal buffer and background thread for serialization. The `swriter.get_topic_writer` pattern removes the need to pre-scan the file. **Topics are created only when they are first encountered**. ``` with client.sequence_create(...) as swriter: # Iterate through all interleaved messages for schema, channel, message in reader.iter_messages(): # 1. Resolve Topic Writer using the SDK cache twriter = swriter.get_topic_writer(channel.topic) # (1)! if twriter is None: ontology_type = determine_mosaico_type(schema.name) if ontology_type is None: print(f"Skipping message on {channel.topic} due to unknown ontology type") # Skip the topic if no ontology type is found continue # Dynamically register the topic writer upon discovery twriter = swriter.topic_create( # (2)! topic_name=channel.topic, metadata={}, ontology_type=ontology_type ) ``` 1. Here we are checking if the a `TopicWriter` for the current topic already exists. 2. Here we are creating the topic writer for the current topic, if it doesn't exist yet. ### Step 4: Pushing Data into the Pipeline¶ The final stage of the ingestion process involves iterating through your data generators and transmitting records to the Mosaico platform by calling the `TopicWriter.push()` method for each record. The `push()` method optimizes the throughput by accumulating messages into internal batches. ``` try: # In a real scenario, use a deserializer like mcap_ros2.decoder raw_data = deserialize_payload(message.data, schema.name) # (1)! mosaico_msg = custom_translator(schema.name, raw_data) if mosaico_msg is None: # Log and skip, or raise if incomplete data is disallowed print("Skipping row due to parsing error") continue # Ignore malformed records twriter.push(message=mosaico_msg) except Exception as e: print(f"Skip error on {channel.topic} at {message.log_time}: {e}") ``` 1. This is an example of a custom function that deserializes the payload of the current message. #### Topic-Level Error Management¶ In the code snippet above, we implemented a **Controlled Ingestion** by wrapping the topic-specific processing and pushing logic within a local `try-except` block. Because the `SequenceWriter` cannot natively distinguish which specific topic failed within your custom processing code (such as a coordinate transformation), an unhandled exception will bubble up and trigger the global sequence-level error policy. To avoid this, you should catch errors locally for each topic. Upcoming versions of the SDK will introduce native **Topic-Level Error Policies**. This feature will allow you to define the error behavior directly when creating the topic, removing the need for boilerplate `try-except` blocks around every sensor stream. ## The full example code¶ ``` """ Import the necessary classes from the Mosaico SDK. """ from mcap.reader import make_reader from mosaicolabs import ( MosaicoClient, # The gateway to the Mosaico Platform OnErrorPolicy, # The error policy for the SequenceWriter Message, # The base class for all data messages IMU, # The IMU sensor data class Vector3d, # The 3D vector class, needed to populate the IMU and GPS data GPS, # The GPS sensor data class GPSStatus, # The GPS status enum, needed to populate the GPS data Pressure, # The Pressure sensor data class ) """ Define the generator functions that yield `Message` objects. For each schema, we define a function that translates the payload of the current message into a `Message` object. """ def custom_translator(schema_name: str, payload: dict): if schema_name == "sensor_msgs/msg/Imu": header = payload['header'] timestamp_ns = Time( sec=header['stamp']['sec'], nanosec=header['stamp']['nanosec'] ).to_nanoseconds() return Message( timestamp_ns=timestamp_ns, data=IMU( acceleration=Vector3d(**payload['linear_acceleration']), angular_velocity=Vector3d(**payload['angular_velocity']) ) ) if schema_name == "sensor_msgs/msg/NavSatFix": header = payload['header'] timestamp_ns = Time( sec=header['stamp']['sec'], nanosec=header['stamp']['nanosec'] ).to_nanoseconds() return Message( timestamp_ns=timestamp_ns, data=GPS( position=Vector3d( x=payload['latitude'], y=payload['longitude'], z=payload['altitude'] ), status=GPSStatus( status=payload['status']['status'], service=payload['status']['service'] ) ) ) if schema_name == "sensor_msgs/msg/FluidPressure": header = payload['header'] timestamp_ns = Time( sec=header['stamp']['sec'], nanosec=header['stamp']['nanosec'] ).to_nanoseconds() return Message( timestamp_ns=timestamp_ns, data=Pressure(value=payload['fluid_pressure']) ) return None def determine_mosaico_type(schema_name: str) -> Optional[Type["Serializable"]]: """Determine the Mosaico type of the topic based on the schema name.""" if schema_name == "sensor_msgs/msg/Imu": return IMU elif schema_name == "sensor_msgs/msg/NavSatFix": return GPS elif schema_name == "sensor_msgs/msg/FluidPressure": return Pressure return None """ Main ingestion orchestration """ def main(): with open("mission_data.mcap", "rb") as f: reader = make_reader(f) with MosaicoClient.connect("localhost", 6726) as client: with client.sequence_create( sequence_name="multi_sensor_ingestion", metadata={"mission": "alpha_test", "environment": "laboratory"}, on_error=OnErrorPolicy.Delete ) as swriter: # Iterate through all interleaved messages for schema, channel, message in reader.iter_messages(): # 1. Resolve Topic Writer using the SDK cache twriter = swriter.get_topic_writer(channel.topic) if twriter is None: ontology_type = determine_mosaico_type(schema.name) if ontology_type is None: print(f"Skipping message on {channel.topic} due to unknown ontology type") # Skip the topic if no ontology type is found continue # Dynamically register the topic writer upon discovery twriter = swriter.topic_create( topic_name=channel.topic, metadata={}, ontology_type=ontology_type ) # 2. Defensive Ingestion: Isolate errors to this specific record try: # In a real scenario, use a deserializer like mcap_ros2.decoder raw_data = deserialize_payload(message.data, schema.name) # Example helper function mosaico_msg = custom_translator(schema.name, raw_data) if mosaico_msg is None: # Log and skip, or raise if incomplete data is disallowed print("Skipping row due to parsing error") continue # Ignore malformed records twriter.push(message=mosaico_msg) except Exception as e: print(f"Skip error on {channel.topic} at {message.log_time}: {e}") # All buffers are flushed and the sequence is committed when exiting the SequenceWriter 'with' block print("Multi-topic ingestion completed!") ``` --- This guide demonstrates how to interact with the Mosaico Data Platform to inspect and retrieve data that has been previously ingested. You will learn how to use the Mosaico SDK to: * **Connect to the catalog** to find existing recordings. * **Inspect sequence metadata** and temporal bounds. * **Access specific topic handlers** to analyze individual sensor streams. For a more in-depth explanation: * **Documentation: The Reading Workflow** * **API Reference: Data Retrieval** ### Step 1: Connecting to the Catalog¶ To begin inspecting data, you must establish a connection via the `MosaicoClient`. Reading is managed through a context manager to ensure all network resources are cleanly released. ``` from mosaicolabs import MosaicoClient # Establish a secure connection to the Mosaico server with MosaicoClient.connect("localhost", 6726) as client: # Use a Handler to inspect the catalog for a specific recording session seq_handler = client.sequence_handler("multi_sensor_ingestion") if not seq_handler: print("Sequence not found in the catalog.") else: # Proceed to inspect metadata (Step 2) pass ``` ### Step 2: Inspecting Sequence Metadata¶ A `SequenceHandler` provides a view of a complete recording session without transferring the actual bulk data yet. This "lazy" inspection allows you to verify session parameters, such as the total size on disk and global user metadata. ``` """Inside the `if seq_handler:` block""" # Print sequence metadata print(f"Sequence: {seq_handler.name}") print(f"• Registered Topics: {seq_handler.topics}") print(f"• User Metadata: {seq_handler.user_metadata}") # Analyze temporal bounds (earliest and latest timestamps across all sensors) # Timestamps are consistently handled in nanoseconds start, end = seq_handler.timestamp_ns_min, seq_handler.timestamp_ns_max print(f"• Duration (ns): {end - start}") # Access structural info from the server size_mb = seq_handler.sequence_info.total_size_bytes / (1024 * 1024) print(f"• Total Size: {size_mb:.2f} MB") print(f"• Created At: {seq_handler.sequence_info.created_datetime}") ``` ### Step 3: Accessing Individual Topics¶ While a sequence represents a "mission," a `TopicHandler` represents a specific data channel within that mission (e.g., a single IMU or GPS). ``` """Inside the `if seq_handler:` block""" # Retrieve a specific handler for the IMU sensor imu_handler = seq_handler.get_topic_handler("sensors/imu") if imu_handler: print(f"Inspecting Topic: {imu_handler.name}") print(f"• Sensor Metadata: {imu_handler.user_metadata}") # Check topic-specific temporal bounds print(f"• Topic Span: {imu_handler.timestamp_ns_min} to {imu_handler.timestamp_ns_max}") # Topic-specific size on the server topic_mb = imu_handler.topic_info.total_size_bytes / (1024 * 1024) print(f"• Topic Size: {topic_mb:.2f} MB") ``` ### Comparison: Sequence vs. Topic Handlers¶ | Feature | Sequence Handler | Topic Handler | | --- | --- | --- | | **Scope** | Entire Recording Session | Single Sensor Stream | | **Metadata** | Mission-wide (e.g., driver, weather) | Sensor-specific (e.g., model, serial) | | **Time Bounds** | Global min/max of all topics | Min/max for that specific stream | | **Topics** | List of all available streams | N/A | ## The full example code¶ ``` from mosaicolabs import MosaicoClient # Establish a secure connection to the Mosaico server with MosaicoClient.connect("localhost", 6726) as client: # Use a Handler to inspect the catalog for a specific recording session seq_handler = client.sequence_handler("multi_sensor_ingestion") if not seq_handler: print("Sequence not found in the catalog.") else: # Proceed to inspect metadata (Step 2) pass # Print sequence metadata print(f"Sequence: {seq_handler.name}") print(f"• Registered Topics: {seq_handler.topics}") print(f"• User Metadata: {seq_handler.user_metadata}") # Analyze temporal bounds (earliest and latest timestamps across all sensors) # Timestamps are consistently handled in nanoseconds start, end = seq_handler.timestamp_ns_min, seq_handler.timestamp_ns_max print(f"• Duration (ns): {end - start}") # Access structural info from the server size_mb = seq_handler.sequence_info.total_size_bytes / (1024 * 1024) print(f"• Total Size: {size_mb:.2f} MB") print(f"• Created At: {seq_handler.sequence_info.created_datetime}") # Retrieve a specific handler for the IMU sensor imu_handler = seq_handler.get_topic_handler("sensors/imu") if imu_handler: print(f"Inspecting Topic: {imu_handler.name}") print(f"• Sensor Metadata: {imu_handler.user_metadata}") # Check topic-specific temporal bounds print(f"• Topic Span: {imu_handler.timestamp_ns_min} to {imu_handler.timestamp_ns_max}") # Topic-specific size on the server topic_mb = imu_handler.topic_info.total_size_bytes / (1024 * 1024) print(f"• Topic Size: {topic_mb:.2f} MB") ``` --- Prerequisites To fully grasp the following How-To, we recommend you to read the **Reading a Sequence and its Topics How-To**. This guide demonstrates how to interact with the Mosaico Data Platform to retrieve the data stream that has been previously ingested. You will learn how to use the Mosaico SDK to: * **Obtain a `SequenceDataStreamer`** to consume recordings from a sequence. * **Obtain a `TopicDataStreamer`** to consume recordings from a topic. For a more in-depth explanation: * **Documentation: The Reading Workflow** * **API Reference: Data Retrieval** ### Unified Multi-Sensor Replay¶ A `SequenceDataStreamer` is designed for sensor fusion and full-system replay. It allows you to consume synchronized multiple data streams—such as high-rate IMU data and low-rate GPS fixes—as if they were a single, coherent timeline. ``` from mosaicolabs import MosaicoClient with MosaicoClient.connect("localhost", 6726) as client: seq_handler = client.sequence_handler("mission_alpha") if seq_handler: # Initialize a Unified Stream for synchronized multi-sensor analysis streamer = seq_handler.get_data_streamer( # Filter specific topics topics=["/gps", "/imu"], # Define the optional temporal window: Only data in this range will be streamed start_timestamp_ns=1738508778000000000, end_timestamp_ns=1738509618000000000, ) print(f"Streaming starts at: {streamer.next_timestamp()}") # Consume the stream. The loop yields messages from both topics in perfect chronological order for topic, msg in streamer: print(f"[{topic}] at {msg.timestamp_ns}: {type(msg.data).__name__}") # Finalize the reading channel to release server resources seq_handler.close() ``` ### Targeted Access¶ A `TopicDataStreamer` provides a dedicated channel for interacting with a single data resource. ``` from mosaicolabs import MosaicoClient, IMU with MosaicoClient.connect("localhost", 6726) as client: # Access a specific topic handler directly via the client top_handler = client.topic_handler("mission_alpha", "/front/imu") if top_handler: # Start a Targeted Stream for isolated, low-overhead replay imu_stream = top_handler.get_data_streamer( # Define the optional temporal window: Only data in this range will be streamed start_timestamp_ns=1738508778000000000, end_timestamp_ns=1738509618000000000, ) # Query the next timestamp, without consuming the message print(f"First sample at: {imu_stream.next_timestamp()}") # Direct loop for maximum efficiency for imu_msg in imu_stream: # Access the strongly-typed IMU data directly process_sample(imu_msg.get_data(IMU)) # Finalize the topic channel top_handler.close() ``` ### Streamer Comparison¶ | Feature | `SequenceDataStreamer` | `TopicDataStreamer` | | --- | --- | --- | | **Primary Use Case** | Multi-sensor fusion & system-wide replay | Isolated sensor analysis & ML training | | **Logic Overhead** | K-Way Merge Sorting | Direct Stream | | **Output Type** | Tuple of `(topic_name, message)` | Single `message` object | | **Temporal Slicing** | Supported | Supported | --- This guide demonstrates how to locate specific recording sessions based on their naming conventions and custom user metadata tags. This is the most common entry point for data discovery, allowing you to isolate sessions that match specific environmental or project conditions. ### The Objective¶ We want to find all sequences where: 1. The sequence name contains the string `"test_drive"`. 2. The user metadata indicates a specific project name (e.g., `"Apollo"`). 3. The environmental visibility was recorded as less than 50m. For a more in-depth explanation: * **Documentation: Querying Catalogs** * **API Reference: Query Builders** * **API Reference: Query Response** ### Implementation¶ When you call multiple `with_*` methods of the `QuerySequence` builder, the platform joins them with a logical **AND** condition. The server will return only the sequences that match the criteria alltogether. ``` from mosaicolabs import MosaicoClient, QuerySequence, Sequence # 1. Establish a connection with MosaicoClient.connect("localhost", 6726) as client: # 2. Execute the query results = client.query( QuerySequence() # Use a convenience method for fuzzy name matching .with_name_match("test_drive") # Use the .Q proxy to filter fixed and dynamic metadata fields .with_expression(Sequence.Q.user_metadata["project"].eq("Apollo")) .with_expression(Sequence.Q.user_metadata["environment.visibility"].lt(50)) # (1)! ) # 3. Process the Response if results: for item in results: # item.sequence contains the information for the matched sequence print(f"Matched Sequence: {item.sequence.name}") print(f" Topics: {[topic.name for topic in item.topics]}") # (2)! ``` 1. Use dot notation to access nested fields in the `user_metadata` dictionary. 2. The `item.topics` list contains all the topics that matched the query. In this case, all the available topics are returned because no topic-specific filters were applied. The `query` method returns `None` if an error occurs, or a `QueryResponse` object. This response acts as a list of `QueryResponseItem` objects, each providing: * **`item.sequence`**: A `QueryResponseItemSequence` containing the sequence metadata. * **`item.topics`**: A list of `QueryResponseItemTopic` objects that matched the query. Result Normalization The `topic.name` returns the relative topic path (e.g., `/front/camera/image`), which is immediately compatible with other SDK methods like `MosaicoClient.topic_handler()`, `SequenceHandler.get_topic_handler()` or streamers. ### Key Concepts¶ * **Convenience Methods**: High-level helpers like `with_name_match()` provide a quick way to filter common fields. * **Generic Methods**: The `with_expression()` method accepts raw **Query Expressions** generated through the `.Q` proxy. This provides full access to every supported operator (`.gt()`, `.lt()`, `.between()`, etc.) for specific fields. * **Dynamic Metadata Access**: Using the bracket notation `Sequence.Q.user_metadata["key"]` allows you to query any custom tag you attached during the ingestion phase. --- This guide demonstrates how to locate specific recording sessions based on their naming conventions and custom user metadata tags. This is the most common entry point for data discovery, allowing you to isolate sessions that match specific environmental or project conditions. ### The Objective¶ We want to find all topics where: 1. The topic refers to an IMU sensor. 2. The user metadata indicates a specific sensor interface (e.g., `"serial"`). For a more in-depth explanation: * **Documentation: Querying Catalogs** * **API Reference: Query Builders** * **API Reference: Query Response** ### Implementation¶ When you call multiple `with_*` methods of the `QueryTopic` builder, the platform joins them with a logical **AND** condition. The server will return only the sequences that match the criteria alltogether. ``` from mosaicolabs import MosaicoClient, QueryTopic, Topic, IMU # 1. Establish a connection with MosaicoClient.connect("localhost", 6726) as client: # 2. Execute the query results = client.query( QueryTopic() # Use a convenience method for fuzzy name matching .with_ontology_tag(IMU.ontology_tag()) # Use the .Q proxy to filter fixed and dynamic metadata fields .with_expression(Topic.Q.user_metadata["interface"].eq("serial"))) # 3. Process the Response if results: for item in results: # item.sequence contains the metadata for the matched sequence print(f"Matched Sequence: {item.sequence.name}") print(f" Topics: {[topic.name for topic in item.topics]}") # (1)! ``` 1. The `item.topics` list contains all the topics that matched the query. In this case, it will contain all the topics that are of type IMU and have the user metadata field `interface` set to `"serial"`. The `query` method returns `None` if an error occurs, or a `QueryResponse` object. This response acts as a list of `QueryResponseItem` objects, each providing: * **`item.sequence`**: A `QueryResponseItemSequence` containing the sequence metadata. * **`item.topics`**: A list of `QueryResponseItemTopic` objects that matched the query. Result Normalization The `topic.name` returns the relative topic path (e.g., `/front/camera/image`), which is immediately compatible with other SDK methods like `MosaicoClient.topic_handler()`, `SequenceHandler.get_topic_handler()` or streamers. ### Key Concepts¶ * **Convenience Methods**: High-level helpers like `with_ontology_tag()` provide a quick way to filter by ontology tags. * **Generic Methods**: The `with_expression()` method accepts raw **Query Expressions** generated through the `.Q` proxy. This provides full access to every supported operator (`.gt()`, `.lt()`, `.between()`, etc.) for specific fields. * **Dynamic Metadata Access**: Using the bracket notation `Topic.Q.user_metadata["key"]` allows you to query any custom tag you attached during the ingestion phase. --- Beyond metadata, the Mosaico platform allows for deep inspection of actual sensor payloads. This guide shows how to search for specific physical events, such as high-impact accelerations, across your entire data catalog. ### The Objective¶ Identify specific time segments where an IMU sensor recorded: 1. Lateral acceleration (-axis) greater than . 2. Longitudinal acceleration (-axis) greater than . For a more in-depth explanation: * **Documentation: Querying Catalogs** * **API Reference: Query Builders** * **API Reference: Query Response** ### Implementation¶ When you call multiple `with_*` methods of the `QueryOntologyCatalog` builder, the platform joins them with a logical **AND** condition. The server will return only the sequences that match the criteria alltogether. ``` from mosaicolabs import MosaicoClient, QueryOntologyCatalog, IMU # 1. Establish a connection with MosaicoClient.connect("localhost", 6726) as client: # 2. Execute the query across all available data # include_timestamp_range is essential for pinpointing the exact time of the event results = client.query( QueryOntologyCatalog(include_timestamp_range=True) .with_expression(IMU.Q.acceleration.x.gt(5.0)) .with_expression(IMU.Q.acceleration.y.gt(4.0)) ) # 3. Process the Response if results: for item in results: print(f"Impact detected in Sequence: {item.sequence.name}") for topic in item.topics: # topic.timestamp_range provides the start and end of the match print(f" - Match in Topic: {topic.name}") # (1)! start, end = topic.timestamp_range.start, topic.timestamp_range.end # (2)! print(f" Event Window: {start} to {end} ns") ``` 1. The `item.topics` list contains all the topics that matched the query. In this case, it will contain all the topics that are of type IMU and for which the data-related filter is met. 2. The `topic.timestamp_range` provides the first and last occurrence of the queried condition within a topic, allowing you to slice data accurately for further analysis. The `query` method returns `None` if an error occurs, or a `QueryResponse` object. This response acts as a list of `QueryResponseItem` objects, each providing: * **`item.sequence`**: A `QueryResponseItemSequence` containing the sequence metadata. * **`item.topics`**: A list of `QueryResponseItemTopic` objects that matched the query. Result Normalization The `topic.name` returns the relative topic path (e.g., `/front/camera/image`), which is immediately compatible with other SDK methods like `MosaicoClient.topic_handler()`, `SequenceHandler.topic_handler()` or streamers. ### Key Concepts¶ * **Generic Methods**: The `with_expression()` method accepts raw **Query Expressions** generated through the `.Q` proxy. This provides full access to every supported operator (`.gt()`, `.lt()`, `.between()`, etc.) for specific fields. * **The `.Q` Proxy**: Every ontology model features a static `.Q` attribute that dynamically builds type-safe field paths for your expressions. * **Temporal Windows**: Setting `include_timestamp_range=True` enables the platform to return the precise "occurrence" of the event, which is vital for later playback or slicing. * **Type-Safe Operators**: The `.Q` proxy ensures that only valid operators (like `.gt()`) are available for numeric fields like `acceleration.x`. --- This guide demonstrates how to orchestrate a **Unified Query** across three distinct layers of the Mosaico Data Platform: the **Sequence** (session metadata), the **Topic** (channel configuration), and the **Ontology Catalog** (actual sensor data). By combining these builders in a single request, you can perform highly targeted searches that correlate mission-level context with specific sensor events. ### The Objective¶ We want to isolate data segments from a large fleet recording by searching for: 1. **Sequence**: Sessions belonging to the `"Apollo"` project. 2. **Topic**: Specifically the front-facing camera imu topic named `"/front/camera/imu"`. 3. **Ontology**: Time segments where such an IMU recorded a longitudinal acceleration (x-axis) exceeding 5.0 m/s². For a more in-depth explanation: * **Documentation: Querying Catalogs** * **API Reference: Query Builders** * **API Reference: Query Response** ### Implementation¶ When you pass multiple builders to the `MosaicoClient.query()` method, the platform joins them with a logical **AND** condition. The server will return only the sequences that match the `QuerySequence` criteria, and within those sequences, only the topics that match both the `QueryTopic` and `QueryOntologyCatalog` criteria. The multi-domain query allows you to execute a search across metadata and raw sensor data in a single, atomic request. ``` from mosaicolabs import MosaicoClient, QuerySequence, QueryTopic, QueryOntologyCatalog, IMU, Sequence # 1. Establish a connection with MosaicoClient.connect("localhost", 6726) as client: # 2. Execute a unified multi-domain query results = client.query( # Filter 1: Sequence Layer (Project Metadata) QuerySequence() .with_expression(Sequence.Q.user_metadata["project.name"].eq("Apollo")), # (1)! # Filter 2: Topic Layer (Specific Channel Name) QueryTopic() .with_name("/front/camera/imu"), # Precise name match # Filter 3: Ontology Layer (Deep Data Event Detection) QueryOntologyCatalog(include_timestamp_range=True) .with_expression(IMU.Q.acceleration.x.gt(5.0)) ) # 3. Process the Structured Response if results: for item in results: # item.sequence contains the matched Sequence metadata print(f"Sequence: {item.sequence.name}") # item.topics contains only the topics and time-segments # that satisfied ALL criteria simultaneously for topic in item.topics: # Access the high-precision timestamp for the detected event print(f" - Match in Topic: {topic.name}") # (2)! start, end = topic.timestamp_range.start, topic.timestamp_range.end # (3)! print(f" Event Window: {start} to {end} ns") ``` 1. Use dot notation to access nested fields in the `user_metadata` dictionary. 2. The `item.topics` list contains all the topics that matched the query. In this case, it will contain all the topics that are of type IMU, with a name matching that specific topic name and for which the data-related filter is met. 3. The `topic.timestamp_range` provides the first and last occurrence of the queried condition within a topic, allowing you to slice data accurately for further analysis. The `query` method returns `None` if an error occurs, or a `QueryResponse` object. This response acts as a list of `QueryResponseItem` objects, each providing: * **`item.sequence`**: A `QueryResponseItemSequence` containing the sequence metadata. * **`item.topics`**: A list of `QueryResponseItemTopic` objects that matched the query. Result Normalization The `topic.name` returns the relative topic path (e.g., `/front/camera/image`), which is immediately compatible with other SDK methods like `MosaicoClient.topic_handler()`, `SequenceHandler.get_topic_handler()` or streamers. ### Key Concepts¶ * **Convenience Methods**: High-level helpers like `with_name_match()` provide a quick way to filter common fields. * **Generic Methods**: The `with_expression()` method accepts raw **Query Expressions** generated through the `.Q` proxy. This provides full access to every supported operator (`.gt()`, `.lt()`, `.between()`, etc.) for specific fields. * **The `.Q` Proxy**: Every ontology model features a static `.Q` attribute that dynamically builds type-safe field paths for your expressions. * **Temporal Windows**: By setting `include_timestamp_range=True` in the `QueryOntologyCatalog`, the platform identifies the exact start and end of the matching event within the stream. * **Dictionary Access**: Use bracket notation (e.g., `Sequence.Q.user_metadata["key.subkey"]`) to query custom tags that are not part of a fixed schema. --- Sometimes a single query is insufficient because you need to correlate data across different topics. This guide demonstrates **Query Chaining**, a technique where the results of one search are used to "lock" the domain for a second, more specific search. ### The Objective¶ Find sequences where a high-precision GPS state was achieved, and **within those same sequences**, locate any log messages containing the string `"[ERR]"`. For a more in-depth explanation: * **Documentation: Querying Catalogs** * **API Reference: Query Builders** * **API Reference: Query Response** ### Implementation¶ ``` from mosaicolabs import MosaicoClient, QueryTopic, QueryOntologyCatalog, GPS, String with MosaicoClient.connect("localhost", 6726) as client: # Step 1: Initial Broad Search # Find all sequences with high-precision GPS (e.g. status code 2) initial_response = client.query( QueryOntologyCatalog(GPS.Q.status.status.eq(2)) ) if initial_response: # Step 2: Domain Locking # .to_query_sequence() creates a new builder pre-filtered to ONLY these sequences refined_domain = initial_response.to_query_sequence() # (1)! # Step 3: Targeted Refinement # Search for error strings only within the validated sequences final_results = client.query( refined_domain, # Restrict to this search domain QueryTopic().with_name("/localization/log_string"), QueryOntologyCatalog(String.Q.data.match("[ERR]")) ) for item in final_results: print(f"Error found in precise sequence: {item.sequence.name}") ``` 1. `to_query_sequence()` returns a `QuerySequence` builder pre-filtered to include only the **sequences** present in the response. See also `to_query_topic()` The `query` method returns `None` if an error occurs, or a `QueryResponse` object. This response acts as a list of `QueryResponseItem` objects, each providing: * **`item.sequence`**: A `QueryResponseItemSequence` containing the sequence metadata. * **`item.topics`**: A list of `QueryResponseItemTopic` objects that matched the query. Result Normalization The `topic.name` returns the relative topic path (e.g., `/front/camera/image`), which is immediately compatible with other SDK methods like `MosaicoClient.topic_handler()`, `SequenceHandler.get_topic_handler()` or streamers. ### Key Concepts¶ * **Convenience Methods**: High-level helpers like `QueryTopic().with_name()` provide a quick way to filter by ontology tags. * **Generic Methods**: The `with_expression()` method accepts raw **Query Expressions** generated through the `.Q` proxy. This provides full access to every supported operator (`.gt()`, `.lt()`, `.between()`, etc.) for specific fields. * **The `.Q` Proxy**: Every ontology model features a static `.Q` attribute that dynamically builds type-safe field paths for your expressions. * **Why is Chaining Necessary?** A single `client.query()` call applies a logical **AND** to all conditions to find a single **topic** that satisfies everything. Since a topic cannot be both a `GPS` stream and a `String` log simultaneously, you must use chaining to link two different topics within the same **Sequence** context. --- This guide walks you through the process of extending the Mosaico Data Platform with custom data models. While Mosaico provides a rich default ontology for robotics (IMU, GPS, Images, etc.), specialized hardware often requires proprietary data structures. By the end of this guide, you will be able to: * **Define** strongly-typed data models using Python and Apache Arrow. * **Register** these models so they are recognized by the Mosaico Ecosystem. * **Integrate** them into the ingestion and retrieval pipelines. For a more in-depth explanation: * **Documentation: Data Models & Ontology** * **API Reference: Base Models and Mixins** ### Step 1: Define the Custom Data Model¶ In Mosaico, data models are defined by inheriting from the **`Serializable`** base class. This ensures that your model can be automatically translated into the platform's high-performance storage format. For this example, we will create a model for **`EncoderTicks`**, found in the NVIDIA Isaac-related datasets. ``` import pyarrow as pa from mosaicolabs import HeaderMixin, Serializable class EncoderTicks( Serializable, # Automatically registers the model via `Serializable.__init_subclass__` HeaderMixin, # Injects standard metadata (timestamp, frame_id, seq) ): """ Custom model for hardware-level encoder tick readings. """ # --- Wire Schema Definition (Apache Arrow) --- # This defines the high-performance binary storage format on the server. __msco_pyarrow_struct__ = pa.struct([ pa.field("left_ticks", pa.uint32(), nullable=False), pa.field("right_ticks", pa.uint32(), nullable=False), pa.field("encoder_timestamp", pa.uint64(), nullable=False), ]) # --- Data Fields --- # Names and types must strictly match the Apache Arrow schema above. left_ticks: int right_ticks: int encoder_timestamp: int ``` ### Step 2: Ensure "Discovery" via Module Import¶ It is a common pitfall to define a class and expect the platform to "see" it immediately. Mosaico utilizes the `Serializable.__init_subclass__` hook to perform **automatic registration** the moment the class is loaded into memory by the Python interpreter. For your custom type to be available in your application (especially during ingestion or when using the `ROSBridge`), you **must** ensure the module containing the class is imported. #### Best Practice: The Registry Pattern¶ Create a dedicated `models.py` or `ontology/` package for your project and import it at your application's entry point. ``` # app/main.py import my_project.ontology.encoders as encoders # <-- This triggers the registration from mosaicolabs import MosaicoClient def run_ingestion(): with MosaicoClient.connect(...) as client: # Now 'EncoderTicks' is a valid ontology_type for topic creation with client.sequence_create(name="test") as sw: tw = sw.topic_create("ticks", ontology_type=encoders.EncoderTicks) # ... ``` ### Step 3: Verifying Registration¶ If you are unsure whether your model has been correctly "seen" by the ecosystem, you can check the internal registry of the `Serializable` class. ``` from mosaicolabs import Serializable import my_project.ontology.encoders as encoders # <-- This triggers the registration if encoders.EncoderTicks.is_registered(): print("Registration successful!") ``` --- Full Example Code The full example code is available under `mosaico-sdk-py/src/examples/ros_injection/main.py`. Prerequisites To fully grasp the following How-To, we recommend you to read the **Customizing the Data Ontology How-To**. Dataset This tutorial uses the `r2b_whitetunnel_0` sequence from the NVIDIA R2B Dataset 2024. This example provides a detailed, step-by-step walkthrough of a complete Mosaico data pipeline, from raw ROS bag ingestion to custom ontology creation and verification. It demonstrates how to bridge the gap between **Robot Operating System (ROS)** data and the **Mosaico Data Platform**. By following this pipeline, you will learn how to: 1. **Create a custom ontology data model** that matches a specific hardware sensor. 2. **Implement a ROS Adapter** that converts raw ROS dictionaries into your custom Mosaico model. 3. **Automate Ingestion** using a high-performance injector to upload a complete recording (MCAP) to the server. 4. **Verify Results** by inspecting the ingested data to ensure structural integrity. ## Running the Example¶ This setup provides a local Mosaico server instance to receive and store the data from your Python scripts. This example expects the Python SDK to be installed via Poetry, as described in the **Installation section**. #### Start the Mosaico Infrastructure¶ First, launch the required backend services (database and ingestion server) using Docker Compose. Run these commands from the `mosaico` root directory: ``` # Navigate to the quickstart environment cd docker/quick_start # Start the Mosaico server and its dependencies docker compose up ``` #### Execute the ROS Injection Script¶ Once the infrastructure is healthy, open a new terminal tab or window to run the demonstration script. Run these commands from the `mosaico-sdk-py` root directory: ``` # Navigate to the examples directory cd src/examples # Run the ROS injection example using poetry poetry run python -m ros_injection.main ``` ### What to Expect¶ * **Server Logs**: In your first terminal, you will see the Docker containers spinning up and the Mosaico Ingestion Server acknowledging incoming connections. * **Injection Progress**: In your second terminal, the `RosbagInjector` will provide a CLI progress bar showing the topics being resolved, messages being adapted, and the final transmission status. * **Data Verification**: After completion, the sequence will be fully cataloged on the server and ready for retrieval via the `SequenceHandler`. **Would you like me to show you how to check the server logs to verify that the sequence was successfully committed to the database?** You should see output similar to the following: ``` Downloading: https://api.ngc.nvidia.com/v2/resources/org/nvidia/team/isaac/r2bdataset2024/1/files?redirect=true&path=r2b_whitetunnel/r2b_whitetunnel_0.mcap Fetching r2b_whitetunnel_0.mcap ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 527.0/527.0 MB XY.Z MB/s 0:00:00 ╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ Phase 2: Starting ROS Ingestion │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ [14:29:38] INFO mosaicolabs: SDK Logging initialized at level: INFO logging_config.py:99 INFO mosaicolabs.ros_bridge.injector: Connecting to Mosaico at 'localhost:6276'... injector.py:266 [14:29:40] INFO mosaicolabs.ros_bridge.injector: Opening bag: '/tmp/mosaico_assets/r2b_whitetunnel_0.mcap' injector.py:274 INFO mosaicolabs.ros_bridge.injector: Starting upload... injector.py:291 /back_stereo_camera/left/camera_info ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 564/564 100.0% • 0:00:00 • 0:00:24 /back_stereo_camera/left/image_compressed ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 525/525 100.0% • 0:00:00 • 0:00:24 /back_stereo_camera/right/camera_info ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 564/564 100.0% • 0:00:00 • 0:00:24 /back_stereo_camera/right/image_compressed ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 490/490 100.0% • 0:00:00 • 0:00:24 /chassis/battery_state ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27/27 100.0% • 0:00:00 • 0:00:19 /chassis/imu ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1080/1080 100.0% • 0:00:00 • 0:00:24 /chassis/odom ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1080/1080 100.0% • 0:00:00 • 0:00:24 /chassis/ticks ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1081/1081 100.0% • 0:00:00 • 0:00:24 /front_stereo_camera/left/camera_info ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 562/562 100.0% • 0:00:00 • 0:00:24 /front_stereo_camera/left/image_compressed ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 521/521 100.0% • 0:00:00 • 0:00:24 /front_stereo_camera/right/camera_info ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 562/562 100.0% • 0:00:00 • 0:00:24 /front_stereo_camera/right/image_compressed ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 515/515 100.0% • 0:00:00 • 0:00:24 /front_stereo_imu/imu ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1761/1761 100.0% • 0:00:00 • 0:00:24 /left_stereo_camera/left/camera_info ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 527/527 100.0% • 0:00:00 • 0:00:24 /left_stereo_camera/left/image_compressed ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 513/513 100.0% • 0:00:00 • 0:00:24 /left_stereo_camera/right/camera_info ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 527/527 100.0% • 0:00:00 • 0:00:24 /left_stereo_camera/right/image_compressed ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 498/498 100.0% • 0:00:00 • 0:00:24 /right_stereo_camera/left/camera_info ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 488/488 100.0% • 0:00:00 • 0:00:24 /right_stereo_camera/left/image_compressed ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 488/488 100.0% • 0:00:00 • 0:00:24 /right_stereo_camera/right/camera_info ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 488/488 100.0% • 0:00:00 • 0:00:24 /right_stereo_camera/right/image_compressed ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 488/488 100.0% • 0:00:00 • 0:00:24 /tf ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4282/4282 100.0% • 0:00:00 • 0:00:24 /tf_static ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 100.0% • 0:00:00 • 0:00:00 Total Upload ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17632/17632 100.0% • 0:00:00 • 0:00:24 ...Other logging messages ``` ## Step-by-Step Guide¶ ### Step 1: Custom Ontology Definition (`isaac.py`)¶ In Mosaico, data is strongly typed. When dealing with specialized hardware like the NVIDIA Isaac Nova encoders, with custom data models, not available in the SDK, we must define a model that the platform understands. #### The Data Model¶ The `EncoderTicks` class defines the physical storage format. ``` import pyarrow as pa from mosaicolabs import HeaderMixin, Serializable class EncoderTicks(Serializable, HeaderMixin): # --- Wire Schema Definition --- __msco_pyarrow_struct__ = pa.struct([ pa.field("left_ticks", pa.uint32(), nullable=False), pa.field("right_ticks", pa.uint32(), nullable=False), pa.field("encoder_timestamp", pa.uint64(), nullable=False), ]) # --- Pydantic Fields --- left_ticks: int right_ticks: int encoder_timestamp: int ``` **What is happening here?** * **`Serializable`**: Inheriting from this class automatically registers your model in the Mosaico ecosystem, making it dispatchable to the data platform, and enables the `.Q` query proxy. * **`HeaderMixin`**: This "injects" a standard `header` (including a timestamp and frame ID) into your model, ensuring it remains compatible with time-series analysis. * **Schema Alignment**: The field names in the `pa.struct` **must match exactly** the names of the Python attributes. For a more in-depth explanation: * **How-To: Customizing the Data Ontology** * **Documentation: Data Models & Ontology** ### Step 2: Implementing the ROS Adapter (`isaac_adapters.py`)¶ A ROS Bag contains raw data dictionaries, that we need to translate into our custom ontology data model, by using *adapters*. The `ROSAdapterBase` class provides the necessary infrastructure for this. We just need to implement the `from_dict` method, which is responsible for converting the raw ROS message dictionary into our custom ontology model. #### The Adapter Implementation¶ ``` from mosaicolabs.ros_bridge import ROSMessage, ROSAdapterBase, register_adapter from mosaicolabs.ros_bridge.adapters.helpers import _make_header, _validate_msgdata from .isaac import EncoderTicks @register_adapter class EncoderTicksAdapter(ROSAdapterBase[EncoderTicks]): ros_msgtype = ("isaac_ros_nova_interfaces/msg/EncoderTicks",) __mosaico_ontology_type__ = EncoderTicks _REQUIRED_KEYS = ("left_ticks", "right_ticks", "encoder_timestamp") @classmethod def from_dict(cls, ros_data: dict) -> EncoderTicks: """ Convert a ROS message dictionary to an EncoderTicks object. """ _validate_msgdata(cls, ros_data) return EncoderTicks( header=_make_header(ros_data.get("header")), left_ticks=ros_data["left_ticks"], right_ticks=ros_data["right_ticks"], encoder_timestamp=ros_data["encoder_timestamp"], ) @classmethod def translate(cls, ros_msg: ROSMessage, **kwargs: Any) -> Message: """ Translates a ROS EncoderTicks message into a Mosaico Message container. """ return super().translate(ros_msg, **kwargs) ``` **Key Operations:** * **`@register_adapter`**: This decorator registers the adapter with the Mosaico ROS Bridge. * **`ros_msgtype`**: A tuple of strings representing the ROS message types that this adapter can handle. * **`__mosaico_ontology_type__`**: The Mosaico ontology type that this adapter can handle. * **`_REQUIRED_KEYS`**: A tuple of strings representing the required keys in the ROS message. This is used by the `_validate_msgdata` method to check that the ROS message does contains the required fields. * **`from_dict`**: This is the heart of the translator. It takes a Python dictionary and maps the keys to our `EncoderTicks` ontology model. * **`translate`**: This method is called by the `RosbagInjector` class for each message in the bag. It is responsible for converting the raw ROS message dictionary into the Mosaico message, wrapping the custom ontology model. For a more in-depth explanation: * **Documentation: ROS Bridge** * **API Reference: ROS Bridge** ### Step 3: The Execution Pipeline (`ros_injection.py`)¶ The main script orchestrates the entire process in three distinct phases. #### Phase 1: Asset Preparation¶ Before we can ingest data, we need the raw file. This phase downloads a verified dataset from NVIDIA. ``` # --- PHASE 1: Asset Preparation --- out_bag_file = download_asset(BAGFILE_URL, ASSET_DIR) ``` #### Phase 2: High-Performance Injection¶ This is where the ROS Bridge takes over. It opens the bag, applies our custom `EncoderTicksAdapter`, plus the adapters already available in the SDK, and streams the data to the server. ``` # Configure the ROS injection. This uses the 'Adaptation' philosophy to translate # ROS types into the Mosaico Ontology. config = ROSInjectionConfig( host=MOSAICO_HOST, port=MOSAICO_PORT, file_path=out_bag_file, sequence_name=out_bag_file.stem, # Sequence name derived from filename # Some example metadata for the sequence metadata={ "source_url": BAGFILE_URL, "ingested_via": "mosaico_example_ros_injection", "download_time_utc": str(downloaded_time), }, log_level="INFO", ) # Handles connection, loading, adaptation, and batching injector = RosbagInjector(config) injector.run() # Starts the ingestion process ``` The **`ROSInjectionConfig`** defines where the server is and what metadata to attach to the sequence. The **`injector.run()`** method handles the heavy lifting—file loading, message adaptation, and network batching—automatically. The **`RosbagInjector`** uses the SDK features to connect to the Mosaico server, orchestrating sequence and topics creation, etc. For a more in-depth explanation: * **Documentation: ROS Bridge** * **API Reference: ROS Bridge** #### Phase 3: Verification & Retrieval¶ Once the upload is finished, we connect to the Mosaico Server to retrieve the data from the sequence just created. ``` with MosaicoClient.connect(host=MOSAICO_HOST, port=MOSAICO_PORT) as client: # Ask for a SequenceHandler for the sequence we just created. # The sequence is identified by its name, which is the stem of the bagfile. shandler = client.sequence_handler(out_bag_file.stem) # Print some information about the sequence print(f"Sequence Name: {shandler.name}") print(f"Topics Found: {len(shandler.topics)}") # ... ``` **Operations:** * **`MosaicoClient.connect()`**: Establishes a secure connection to the platform. * **`MosaicoClient.sequence_handler()`**: Retrieves a specialized object used to manage and query the specific recording we just uploaded. For a more in-depth explanation: * **Documentation: The Reading Workflow** * **API Reference: Data Retrieval** ## The full example code¶ The full example code is available under `mosaico-sdk-py/src/examples/ros_injection/main.py`. --- API Reference: `mosaicolabs.comm.MosaicoClient`. The `MosaicoClient` is a resource manager designed to orchestrate three distinct **Layers** of communication and processing. This layered architecture ensures that high-throughput sensor data does not block critical control operations or application logic. ## Control Layer¶ A single, dedicated connection is maintained for metadata operations. This layer handles lightweight tasks such as creating sequences, querying the catalog, and managing schema definitions. By isolating control traffic, the client ensures that critical commands (like `sequence_finalize`) are never queued behind heavy data transfers. ## Data Layer¶ For high-bandwidth data ingestion (e.g., uploading 4x 1080p cameras simultaneously), the client maintains a **Connection Pool** of multiple Flight clients. The SDK automatically stripes writes across these connections in a round-robin fashion, allowing the application to saturate the available network bandwidth. ## Processing Layer¶ Serialization of complex sensor data (like compressing images or encoding LIDAR point clouds) is CPU-intensive. The SDK uses an **Executor Pool** of background threads to offload these tasks. This ensures that while one thread is serializing the *next* batch of data, another thread is already transmitting the *previous* batch over the network. **Best Practice:** It is recomended to always use the client inside a `with` context to ensure resources in all layers are cleanly released. ``` from mosaicolabs import MosaicoClient with MosaicoClient.connect("localhost", 6726) as client: # Logic goes here pass # Pools and connections are closed automatically ``` --- The **Mosaico Data Ontology** is the semantic backbone of the SDK. It defines the structural "rules" that transform raw binary streams into meaningful physical data, such as GPS coordinates, inertial measurements, or camera frames. By using a strongly-typed ontology, Mosaico ensures that your data remains consistent, validatable, and highly optimized for both high-throughput transport and complex queries. ## Core Philosophy¶ The ontology is designed to solve the "generic data" problem in robotics by ensuring every data object is: 1. **Validatable**: Uses Pydantic for strict runtime type checking of sensor fields. 2. **Serializable**: Automatically maps Python objects to efficient **PyArrow** schemas for high-speed binary transport. 3. **Queryable**: Injects a fluent API (`.Q`) into every class, allowing you to filter databases based on physical values (e.g., `IMU.Q.acceleration.x > 6.0`). 4. **Middleware-Agnostic**: Acts as an abstraction layer so that your analysis code doesn't care if the data originally came from ROS, a simulator, or a custom logger. ## Available Ontology Classes¶ The Mosaico SDK provides a comprehensive library of models that transform raw binary streams into validated, queryable Python objects. These are grouped by their physical and logical application below. ### Base Data Models¶ API Reference: Base Data Types These models serve as timestamped, metadata-aware wrappers for standard primitives. They allow simple diagnostic or scalar values to be treated as first-class members of the platform. | Module | Classes | Purpose | | --- | --- | --- | | **Primitives** | `String`, `LargeString` | UTF-8 text data for logs or status messages. | | **Booleans** | `Boolean` | Logic flags (True/False). | | **Signed Integers** | `Integer8`, `Integer16`, `Integer32`, `Integer64` | Signed whole numbers of varying bit-depth. | | **Unsigned Integers** | `Unsigned8`, `Unsigned16`, `Unsigned32`, `Unsigned64` | Non-negative integers for counters or IDs. | | **Floating Point** | `Floating16`, `Floating32`, `Floating64` | Real numbers for high-precision physical values. | ### Geometry & Kinematics Models¶ API Reference: Geometry Models These structures define spatial relationships and the movement states of objects in 2D or 3D coordinate frames. | Module | Classes | Purpose | | --- | --- | --- | | **Points & Vectors** | `Vector2d/3d/4d`, `Point2d/3d` | Fundamental spatial directions and locations. | | **Rotations** | `Quaternion` | Compact, singularity-free 3D orientation (). | | **Spatial State** | `Pose`, `Transform` | Absolute positions or relative coordinate frame shifts. | | **Motion** | `Velocity`, `Acceleration` | Linear and angular movement rates (Twists and Accels). | | **Aggregated State** | `MotionState` | An atomic snapshot combining Pose, Velocity, and Acceleration. | ### Sensor Models¶ API Reference: Sensor Models High-level models representing physical hardware devices and their processed outputs. | Module | Classes | Purpose | | --- | --- | --- | | **Inertial** | `IMU` | 6-DOF inertial data: linear acceleration and angular velocity. | | **Navigation** | `GPS`, `GPSStatus`, `NMEASentence` | Geodetic fixes (WGS 84), signal quality, and raw NMEA strings. | | **Vision** | `Image`, `CompressedImage`, `CameraInfo`, `ROI` | Raw pixels, encoded streams (JPEG/H264), calibration, and regions of interest. | | **Environment** | `Temperature`, `Pressure`, `Range` | Thermal readings (K), pressure (Pa), and distance intervals (m). | | **Dynamics** | `ForceTorque` | 3D force and torque vectors for load sensing. | | **Magnetic** | `Magnetometer` | Magnetic field vectors measured in microTesla (). | | **Robotics** | `RobotJoint` | States (position, velocity, effort) for index-aligned actuator arrays. | ## Architecture¶ The ontology architecture relies on three primary abstractions: the **Factory** (`Serializable`), the **Envelope** (`Message`) and the **Mixins** ### 1. `Serializable` (The Factory)¶ API Reference: `mosaicolabs.models.Serializable` Every data payload in Mosaico inherits from the `Serializable` class. It manages the global registry of data types and ensures that the system knows exactly how to convert a string tag like `"imu"` back into a Python class with a specific binary schema. `Serializable` uses the `__init_subclass__` hook, which is automatically called whenever a developer defines a new subclass. ``` class MyCustomSensor(Serializable): # <--- __init_subclass__ triggers here ... ``` When this happens, `Serializable` performs the following steps automatically: 1. **Validates Schema:** Checks if the subclass defined the PyArrow struct schema (`__msco_pyarrow_struct__`). If missing, it raises an error at definition time (import time), preventing runtime failures later. 2. **Generates Tag:** If the class doesn't define `__ontology_tag__`, it auto-generates one from the class name (e.g., `MyCustomSensor` -> `"my_custom_sensor"`). 3. **Registers Class:** It adds the new class to the global types registry. 4. **Injects Query Proxy:** It dynamically adds a `.Q` attribute to the class, enabling the fluent query syntax (e.g., `MyCustomSensor.Q.voltage > 12.0`). ### 2. `Message` (The Envelope)¶ API Reference: `mosaicolabs.models.Message` The **`Message`** class is the universal transport envelope for all data within the Mosaico platform. It acts as a wrapper that combines specific sensor data (the payload) with middleware-level metadata. ``` from mosaicolabs import Message, Time, Header, Temperature # Use Case: Create a Temperature timestamped message with uncertainty meas_time = Time.now() temp_msg = Message( timestamp_ns=meas_time.to_nanoseconds(), # here the message timestamp is the same as the measurement, but it can be different data=Temperature.from_celsius( value=57, header=Header(stamp=meas_time, frame_id="comp_case"), variance=0.03 ) ) ``` While logically a `Message` contains a `data` object (e.g., an instance of an Ontology type), physically on the wire (PyArrow/Parquet), the fields are **flattened**. * **Logical:** `Message(timestamp_ns=123567890, data=IMU(acceleration=Vector3d(x=1.0,...)))` * **Physical:** `Struct(timestamp_ns=123567890, acceleration, ...)` This flattening is handled automatically by the class internal methods. This ensures zero-overhead access to nested data during queries while maintaining a clean object-oriented API in Python. ### 3. Mixins: Headers & Uncertainty¶ Mosaico uses **Mixins** to inject standard fields across different data types, ensuring a consistent interface. Almost every class in the ontology, from high-level sensors down to elementary data primitives like `Vector3d` or `Float32`, inherits from two Mixin classes, which inject standard fields into data models via composition, ensuring consistency across different sensor types. The integration of mixins into the Mosaico Data Ontology enables a flexible dual-usage pattern, **Standalone Messages** and **Embedded Fields**, which will be detailed later and allow base geometric types to serve as either independent data streams or granular components of complex sensor models. #### `HeaderMixin`¶ API Reference: `mosaicolabs.models.mixins.HeaderMixin` Injects a standard (Optional) `header` containing a sequence ID, a frame ID (e.g., `"base_link"`), and a high-precision acquisition timestamp (`stamp`). ``` class MySensor(Serializable, HeaderMixin): # Injects a header with stamp, frame_id, and seq fields ... ``` #### `CovarianceMixin`¶ API Reference: `mosaicolabs.models.mixins.CovarianceMixin` Injects multidimensional uncertainty fields, typically used for flattened covariance matrices in sensor fusion applications. ``` class MySensor(Serializable, CovarianceMixin): # Injects a covariance matrix with covariance and covariance_type fields ... ``` #### `VarianceMixin`¶ API Reference: `mosaicolabs.models.mixins.VarianceMixin` Injects monodimensional uncertainty fields, useful for sensors with 1-dimensional uncertain data (like `Temperature` or `Pressure`). ``` class MySensor(Serializable, VarianceMixin): # Injects a variance with variance and variance_type fields ... ``` #### Standalone Usage¶ Because elementary types (such as `Vector3d`, `String`, or `Float32`) inherit directly from these mixins, they are "first-class" members of the ontology. You can treat them as independent, timestamped messages without needing to wrap them in a more complex container. This is ideal for pushing processed signals, debug values, or simple sensor readings that require their own metadata and uncertainty context. ``` # Use Case: Sending a raw 3D vector as a timestamped message with uncertainty accel_msg = Vector3d( x=0.0, y=0.0, z=9.81, header=Header(stamp=Time.now(), frame_id="base_link"), covariance=[0.01, 0, 0, 0, 0.01, 0, 0, 0, 0.01] # 3x3 Diagonal matrix ) # `acc_writer` is a TopicWriter associated to the new sequence that is being uploaded. acc_writer.push(message=Message(timestamp_ns=ts, data=accel_msg)) # (1)! # Use Case: Sending a timestamped diagnostic error error_msg = String( data="Waypoint-miss in navigation detected!", header=Header(stamp=Time.now(), frame_id="base_link") ) # `log_writer` is another TopicWriter associated to the new sequence that is being uploaded. log_writer.push(message=Message(timestamp_ns=ts, data=error_msg)) ``` 1. The `push` command will be covered in the documentation of the Writers API Reference: * `mosaicolabs.handlers.SequenceWriter` * `mosaicolabs.handlers.TopicWriter` #### Embedded Usage¶ When these base types are used as internal fields within a larger structure (e.g., an `IMU` or `MotionState` model), the mixins allow you to attach metadata to specific *parts* of a message. In this context, while the parent object (the `IMU`) carries a global timestamp, the individual fields (like `acceleration`) can carry their own specific **covariance** matrices. To avoid data redundancy, the internal `header` of the embedded field is typically left as `None`, as it inherits the temporal context from the parent message. ``` # Use Case: Embedding Vector3d inside a complex IMU message imu_msg = IMU( # Parent Header: Defines the time and frame for the entire sensor packet header=Header(stamp=Time.now(), frame_id="imu_link"), # Embedded Field 1: Acceleration # Inherits global time, but specifies its own unique uncertainty acceleration=Vector3d( x=0.5, y=-0.2, z=9.8, covariance=[0.1, 0, 0, 0, 0.1, 0, 0, 0, 0.1] # Specific to acceleration ), # Embedded Field 2: Angular Velocity # Carries a distinct covariance matrix independent of the acceleration angular_velocity=Vector3d( x=0.01, y=0.0, z=-0.01, covariance=[0.05, 0, 0, 0, 0.05, 0, 0, 0, 0.05] # Specific to velocity ) ) # as above, `imu_writer` is another TopicWriter associated to the new sequence that is being uploaded. imu_writer.push(imu_msg) ``` ## Querying Data Ontology with the Query (`.Q`) Proxy¶ The Mosaico SDK allows you to perform deep discovery directly on the physical content of your sensor streams. Every class inheriting from `Serializable`, including standard sensors, geometric primitives, and custom user models, is automatically injected with a static **`.Q` proxy** attribute. This proxy acts as a type-safe bridge between your Python data models and the platform's search engine, enabling you to construct complex filters using standard Python dot notation. ### How the Proxy Works¶ The `.Q` proxy recursively inspects the model’s schema to expose every queryable field path. It identifies the data type of each field and provides only the operators valid for that type (e.g., numeric comparisons for acceleration, substring matches for frame IDs). * **Direct Field Access**: Filter based on primary values, such as `Temperature.Q.value.gt(25.0)`. * **Nested Navigation**: Traverse complex, embedded structures. For example, in the `GPS` model, you can drill down into the status sub-field: `GPS.Q.status.satellites.geq(8)`. * **Mixin Integration**: Fields inherited from mixins are automatically included in the proxy. This allows you to query standard metadata (from `HeaderMixin`) or uncertainty metrics (from `VarianceMixin` or `CovarianceMixin`) across any model. ### Queryability Examples¶ The following table illustrates how the proxy flattens complex hierarchies into queryable paths: | Type Field Path | Proxy Field Path | Source Type | Queryable Type | Supported Operators | | --- | --- | --- | --- | --- | | `IMU.acceleration.x` | `IMU.Q.acceleration.x` | `float` | **Numeric** | `.eq()`, `.neq()`, `.lt()`, `.gt()`, `.leq()`, `.geq()`, `.in_()`, `.between()` | | `GPS.status.hdop` | `GPS.Q.status.hdop` | `float` | **Numeric** | `.eq()`, `.neq()`, `.lt()`, `.gt()`, `.leq()`, `.geq()`, `.in_()`, `.between()` | | `IMU.header.frame_id` | `IMU.Q.header.frame_id` | `str` | **String** | `.eq()`, `.neq()`, `.match()`, `.in_()` | | `GPS.covariance_type` | `GPS.Q.covariance_type` | `int` | **Numeric** | `.eq()`, `.neq()`, `.lt()`, `.gt()`, `.leq()`, `.geq()`, `.in_()`, `.between()` | ### Practical Usage¶ To execute these filters, pass the expressions generated by the proxy to the `QueryOntologyCatalog` builder. ``` from mosaicolabs import MosaicoClient, IMU, GPS, QueryOntologyCatalog with MosaicoClient.connect("localhost", 6726) as client: # orchestrate a query filtering by physical thresholds AND metadata qresponse = client.query( QueryOntologyCatalog(include_timestamp_range=True) # Ask for the start/end timestamps of occurrences .with_expression(IMU.Q.acceleration.z.gt(15.0)) .with_expression(GPS.Q.status.service.eq(2)) ) # The server returns a QueryResponse grouped by Sequence for structured data management if qresponse is not None: for item in qresponse: # 'item.sequence' contains the name for the matched sequence print(f"Sequence: {item.sequence.name}") # 'item.topics' contains only the topics and time-segments # that satisfied the QueryOntologyCatalog criteria for topic in item.topics: # Access high-precision timestamps for the data segments found start, end = topic.timestamp_range.start, topic.timestamp_range.end print(f" Topic: {topic.name} | Match Window: {start} to {end}") ``` For a comprehensive list of all supported operators and advanced filtering strategies (such as query chaining), see the **Full Query Documentation** and the Ontology types SDK Reference in the **API Reference**: * Base Data Models * Sensors Models * Geometry Models * Platform Models ## Customizing the Ontology¶ The Mosaico SDK is built for extensibility, allowing you to define domain-specific data structures that can be registered to the platform and live alongside standard types. Custom types are automatically validatable, serializable, and queryable once registered in the platform. Follow these three steps to implement a compatible custom data type: ### 1. Inheritance and Mixins¶ Your custom class **must** inherit from `Serializable` to enable auto-registration, factory creation, and the queryability of the model. To align with the Mosaico ecosystem, use the following mixins: * **`HeaderMixin`**: Required for timestamped data or sensor readings. It injects a standard `header` (stamp, frame\_id, seq), ensuring your data remains compatible with time-synchronization and coordinate frame logic. * **`CovarianceMixin`**: Used for data including measurement uncertainty, standardizing the storage of covariance matrices. ### 2. Define the Wire Schema (`__msco_pyarrow_struct__`)¶ You must define a class-level `__msco_pyarrow_struct__` using `pyarrow.struct`. This explicitly dictates how your Python object is serialized into high-performance Apache Arrow/Parquet buffers for network transmission and storage. #### 2.1 Serialization Format Optimization¶ API Reference: `mosaicolabs.enum.SerializationFormat` You can optimize remote server performance by overriding the `__serialization_format__` attribute. This controls how the server compresses and organizes your data. | Format | Identifier | Use Case Recommendation | | --- | --- | --- | | **Default** | `"default"` | **Standard Table**: Fixed-width data with a constant number of fields. | | **Ragged** | `"ragged"` | **Variable Length**: Best for lists, sequences, or point clouds. | | **Image** | `"image"` | **Blobs**: Raw or compressed images requiring specialized codec handling. | If not explicitly set, the system defaults to `Default` format. ### 3. Define Class Fields¶ Define the Python attributes for your class using standard type hints. Note that the names of your Python class fields **must match exactly** the field names defined in your `__msco_pyarrow_struct__` schema. ### Customization Example: `EnvironmentSensor`¶ This example demonstrates a custom sensor for environmental monitoring that tracks temperature, humidity, and pressure. ``` # file: custom_ontology.py from typing import Optional import pyarrow as pa from mosaicolabs.models import Serializable, HeaderMixin class EnvironmentSensor(Serializable, HeaderMixin): """ Custom sensor reading for Temperature, Humidity, and Pressure. """ # --- 1. Define the Wire Schema (PyArrow Layout) --- __msco_pyarrow_struct__ = pa.struct( [ pa.field("temperature", pa.float32(), nullable=False), pa.field("humidity", pa.float32(), nullable=True), pa.field("pressure", pa.float32(), nullable=True), ] ) # --- 2. Define Python Fields (Must match schema exactly) --- temperature: float humidity: Optional[float] = None pressure: Optional[float] = None # --- Usage Example --- from mosaicolabs.models import Message, Header, Time # Initialize with standard metadata meas = EnvironmentSensor( header=Header(stamp=Time.now(), frame_id="lab_sensor_1"), temperature=23.5, humidity=0.45 ) # Ready for streaming or querying # writer.push(Message(timestamp_ns=ts, data=meas)) ``` Schema for defining a custom ontology model. --- The **Data Handling** module serves as the high-performance operational core of the Mosaico SDK, providing a unified interface for moving multi-modal sensor data between local applications and the Mosaico Data Platform. Engineered to solve the "Big Data" challenges of robotics and autonomous systems, this module abstracts the complexities of network I/O, asynchronous buffering, and high-precision temporal alignment. ### Asymmetric Architecture¶ The SDK employs a specialized architecture that separates concerns into **Writers** and **Handlers**, ensuring each layer is optimized for its unique traffic pattern: * **Ingestion (Writing)**: Designed for low-latency, high-throughput ingestion of 4K video, high-frequency IMU telemetry, and dense point clouds. It utilizes a "Multi-Lane" approach where each sensor stream operates in isolation with dedicated system resources. * **Discovery & Retrieval (Reading)**: Architected to separate metadata-based resource discovery from high-volume data transmission. This separation allows developers to inspect sequence and topic catalogs—querying metadata and temporal bounds—before committing to a high-bandwidth data stream. ### Memory-Efficient Data Flow¶ The Mosaico SDK is engineered to handle massive data volumes without exhausting local system resources, enabling the processing of datasets that span terabytes while maintaining a minimal and predictable memory footprint. * **Smart Batching & Buffering**: Both reading and writing operations are executed in memory-limited batches rather than loading or sending entire sequences at once. * **Asynchronous Processing**: The SDK offloads CPU-intensive tasks, such as image serialization and network I/O, to background threads within the `MosaicoClient`. * **Automated Lifecycle**: In reading workflows, processed batches are automatically discarded and replaced with new data from the server. In writing workflows, buffers are automatically flushed based on configurable size or record limits. * **Stream Persistence**: Integrated **Error Policies** allow developers to prioritize either a "clean slate" data state or "recovery" of partial data in the event of an application crash. --- The **Writing Workflow** in Mosaico is designed for high-throughput data ingestion, ensuring that your application remains responsive even when streaming high-bandwidth sensor data like 4K video or high-frequency IMU telemetry. The architecture is built around a **"Multi-Lane"** approach, where each sensor stream operates in its own isolated lane with dedicated system resources. ### The Orchestrator: `SequenceWriter`¶ API Reference: `mosaicolabs.handlers.SequenceWriter`. The `SequenceWriter` acts as the central controller for a recording session. It manages the high-level lifecycle of the data on the server and serves as the factory for individual sensor streams. **Key Roles:** * **Lifecycle Management**: It handles the lifecycle of a new sequence and ensures that it is either successfully committed as immutable data or, in the event of a failure, cleaned up according to your configured `OnErrorPolicy`. * **Resource Distribution**: The writer pulls network connections from the **Connection Pool** and background threads from the **Executor Pool**, assigning them to individual topics. This isolation prevents a slow network connection on one topic from bottlenecking others. * **Context Safety**: To ensure data integrity, the `SequenceWriter` must be used within a Python `with` block. This guarantees that all buffers are flushed and the sequence is closed properly, even if your application crashes. ``` from mosaicolabs import MosaicoClient, OnErrorPolicy # Open the connection with the Mosaico Client with MosaicoClient.connect("localhost", 6726) as client: # Start the Sequence Orchestrator with client.sequence_create( sequence_name="mission_log_042", # Custom metadata for this data sequence. metadata={ # (1)! "vehicle": { "vehicle_id": "veh_sim_042", "powertrain": "EV", "sensor_rig_version": "v3.2.1", "software_stack": { "perception": "perception-5.14.0", "localization": "loc-2.9.3", "planning": "plan-4.1.7", }, }, "driver": { "driver_id": "drv_sim_017", "role": "validation", "experience_level": "senior", }, } on_error = OnErrorPolicy.Delete # Default ) as seq_writer: # `seq_writer` is the writing handler of the new 'mission_log_042' sequence # Data will be uploaded by spawning topic writers that will manage the actual data stream # remote push... See below. ``` 1. The metadata fields will be queryable via the `Query` mechanism. The mechanism allows creating queries like: `Sequence.Q.user_metadata["vehicle.software_stack.planning"].match("plan-4.")` ### The Data Engine: `TopicWriter`¶ API Reference: `mosaicolabs.handlers.TopicWriter`. Once a topic is created, a `TopicWriter` is spawned to handle the actual transmission of data for that specific stream. It abstracts the underlying networking protocols, allowing you to simply "push" Python objects while it handles the heavy lifting. **Key Roles:** * **Smart Buffering**: Instead of sending every single message over the network—which would be highly inefficient—the `TopicWriter` accumulates records in a memory buffer. * **Automated Flushing**: The writer automatically triggers a "flush" to the server whenever the internal buffer exceeds your configured limits, such as a maximum byte size or a specific number of records. * **Asynchronous Serialization**: For CPU-intensive data (like encoding images), the writer can offload the serialization process to background threads, ensuring your main application loop stays fast. ``` # Continues from the code above... # 👉 with client.sequence_create(...) as seq_writer: # Create individual Topic Writers # Each writer gets its own assigned resources from the pools imu_writer = seq_writer.topic_create( topic_name="sensors/imu", # The univocal topic name metadata={ # The topic/sensor custom metadata "vendor": "inertix-dynamics", "model": "ixd-f100", "firmware_version": "1.2.0", "serial_number": "IMUF-9A31D72X", "calibrated":"false", }, ontology_type=IMU, # The ontology type stored in this topic ) # Another individual topic writer for the GPS device gps_writer = seq_writer.topic_create( topic_name="sensors/gps", # The univocal topic name metadata={ # The topic/sensor custom metadata "role": "primary_gps", "vendor": "satnavics", "model": "snx-g500", "firmware_version": "3.2.0", "serial_number": "GPS-7C1F4A9B", "interface": { # (1)! "type": "UART", "baudrate": 115200, "protocol": "NMEA", }, }, # The topic/sensor custom metadata ontology_type=GPS, # The ontology type stored in this topic ) # Push data - The SDK handles batching and background I/O imu_writer.push( message=Message( timestamp_ns=1700000000000, data=IMU(acceleration=Vector3d(x=0, y=0, z=9.81), ...), ) ) gps_writer.push( message=Message( timestamp_ns=1700000000100, data=GPS(position=Vector3d(x=44.0123,y=10.12345,z=0), ...), ) ) # Exiting the block automatically flushes all topic buffers, finalizes the sequence on the server # and closes all connections and pools ``` 1. The metadata fields will be queryable via the `Query` mechanism. The mechanism allows creating query expressions like: `Topic.Q.user_metadata["interface.type"].eq("UART")`. API Reference: * `mosaicolabs.models.platform.Topic` * `mosaicolabs.models.query.builders.QueryTopic`. ### Resilient Data Ingestion & Error Management¶ Recording high-bandwidth sensor data in dynamic environments requires a tiered approach to error handling. While the Mosaico SDK provides automated recovery through **Error Policies**, these act as a "last line of defense". For robust production pipelines, you must implement **Defensive Ingestion Patterns** to prevent isolated failures from compromising your entire recording session. ### Sequence-Level Error Handling¶ API Reference: `mosaicolabs.enum.OnErrorPolicy`. Configured when instantiating a new `SequenceWriter` via `MosaicoClient.connect()` factory, these policies dictate how the server handles a sequence if an unhandled exception bubbles up to the `SequenceWriter` context manager. #### 1. `OnErrorPolicy.Delete` (The "Clean Slate" Policy)¶ * **Behavior**: If an error occurs, the SDK sends an `ABORT` signal to the server. * **Result**: The server immediately deletes the entire sequence and all associated topic data. * **Best For**: CI/CD pipelines, unit testing, or "Gold Dataset" generation where partial or corrupted logs are unacceptable. #### 2. `OnErrorPolicy.Report` (The "Recovery" Policy)¶ * **Behavior**: The SDK finalizes data that successfully reached the server and sends a `NOTIFY_CREATE` signal with error details. * **Result**: The sequence is preserved but remains in an **unlocked (pending) state**, allowing for forensic analysis. * **Best For**: Field tests and mission-critical logs where lead-up data is essential for debugging. An example schematic rationale for deciding between the two policies can be: | Scenario | Recommended Policy | Rationale | | --- | --- | --- | | **Edge/Field Tests** | `OnErrorPolicy.Report` | Forensic value: "Partial data is better than no data" for crash analysis. | | **Automated CI/CD** | `OnErrorPolicy.Delete` | Platform hygiene: Prevents cluttering the catalog with junk data from failed runs. | | **Ground Truth Generation** | `OnErrorPolicy.Delete` | Integrity: Ensures only 100% verified, complete sequences enter the database. | ### Topic-Level Error Handling¶ Because the `SequenceWriter` cannot natively distinguish which specific topic failed within your injection script or custom processing code (such as a coordinate transformations), an unhandled exception will bubble up and trigger the global sequence-level error policy. To avoid this, you should catch errors locally for each topic. It is highly recommended to wrap the topic-specific processing and pushing logic within a local `try-except` block, if a single failure is accepted and the entire sequence can still be accepted with partial data on failing topics. As an example, see the How-Tos Upcoming versions of the SDK will introduce native **Topic-Level Error Policies**, which will allow the user to define the error behavior directly when creating the topic, removing the need for boilerplate `try-except` blocks around every sensor stream. --- The **Reading Workflow** in Mosaico is architected to separate resource discovery from high-volume data transmission. This is achieved through two distinct layers: **Handlers**, which serve as metadata proxies, and **Streamers**, which act as the high-performance data engines. ### Handlers: The Catalog Layer¶ Handlers are lightweight objects that represent a server-side resource. Their primary role is to provide immediate access to system information and user-defined metadata **without downloading the actual sensor data**. They act as the "Catalog" layer of the SDK, allowing you to inspect the contents of the platform before committing to a high-bandwidth data stream. Mosaico provides two specialized handler types: `SequenceHandler` and `TopicHandler`. #### `SequenceHandler`¶ API Reference: `mosaicolabs.handlers.SequenceHandler`. Represents a complete recording session. It provides a holistic view, allowing you to inspect all available topic names, global sequence metadata, and the overall temporal bounds (earliest and latest timestamps) of the session. This example demonstrates how to use a Sequence handler to inspect metadata. ``` import sys from mosaicolabs import MosaicoClient with MosaicoClient.connect("localhost", 6726) as client: # Use a Handler to inspect the catalog seq_handler = client.sequence_handler("mission_alpha") if seq_handler: print(f"Sequence: {seq_handler.name}") print(f"\t| Topics: {seq_handler.topics}") print(f"\t| User metadata: {seq_handler.user_metadata}") print(f"\t| Timestamp span: {seq_handler.timestamp_ns_min} - {seq_handler.timestamp_ns_max}") print(f"\t| Created {seq_handler.sequence_info.created_datetime}") print(f"\t| Size (MB) {seq_handler.sequence_info.total_size_bytes/(1024*1024)}") # Once done, close the reading channel (recommended) seq_handler.close() ``` #### `TopicHandler`¶ API Reference: `mosaicolabs.handlers.TopicHandler`. Represents a specific data channel within a sequence (e.g., a single IMU or Camera). It provides granular system info, such as the specific ontology model used and the data volume of that individual stream. This example demonstrates how to use a Topic handler to inspect metadata. ``` import sys from mosaicolabs import MosaicoClient with MosaicoClient.connect("localhost", 6726) as client: # Use a Handler to inspect the catalog top_handler = client.topic_handler("mission_alpha", "/front/imu") # Note that the same handler can be retrieve via the SequenceHandler of the parent sequence: # seq_handler = client.sequence_handler("mission_alpha") # top_handler = seq_handler.get_topic_handler("/front/imu") if top_handler: print(f"Sequence:Topic: {top_handler.sequence_name}:{top_handler.name}") print(f"\t| User metadata: {top_handler.user_metadata}") print(f"\t| Timestamp span: {top_handler.timestamp_ns_min} - {top_handler.timestamp_ns_max}") print(f"\t| Created {top_handler.topic_info.created_datetime}") print(f"\t| Size (MB) {top_handler.topic_info.total_size_bytes/(1024*1024)}") # Once done, close the reading channel (recommended) top_handler.close() ``` ### Streamers: The Data Engines¶ Both handlers serve as **factories**; once you have identified the resource you need, the handler is used to spawn the appropriate Streamer to begin data consumption. Streamers are the active components that manage the physical data exchange between the server and your application. They handle the complexities of network buffering, batch management, and the de-serialization of raw bytes into Mosaico `Message` objects. #### `SequenceDataStreamer` (Unified Replay)¶ API Reference: `mosaicolabs.handlers.SequenceDataStreamer`. The **`SequenceDataStreamer`** is a unified engine designed specifically for sensor fusion and full-system replay. It allows you to consume multiple data streams as if they were a single, coherent timeline. To achieve this, the streamer employs the following technical mechanisms: * **K-Way Merge Sorting**: The streamer monitors the timestamps across all requested topics simultaneously. On every iteration, it "peeks" at the next available message from each topic and yields the one with the lowest timestamp. * **Strict Chronological Order**: This sorting ensures that messages are delivered in exact acquisition order, effectively normalizing topics that may operate at vastly different frequencies (e.g., high-rate IMU vs. low-rate GPS). * **Temporal Slicing**: You can request a "windowed" extraction by specifying `start_timestamp_ns` and `end_timestamp_ns`. This is highly efficient as it avoids downloading the entire sequence, focusing only on the specific event or time range of interest. * **Smart Buffering**: To maintain memory efficiency, the streamer retrieves data in memory-limited batches. As you iterate, processed batches are discarded and replaced with new data from the server, allowing you to stream sequences that exceed your available RAM. This example demonstrates how to initiate and use the Sequence data stream. ``` import sys from mosaicolabs import MosaicoClient with MosaicoClient.connect("localhost", 6726) as client: # Use a Handler to inspect the catalog seq_handler = client.sequence_handler("mission_alpha") if seq_handler: # Start a Unified Stream (K-Way Merge) for multi-sensor replay # We only want GPS and IMU data for this synchronized analysis streamer = seq_handler.get_data_streamer( topics=["/gps", "/imu"], # Optionally filter topics # Optionally set the time window to extract start_timestamp_ns=1738508778000000000, end_timestamp_ns=1738509618000000000 ) # Check the start message timestamp print(f"Recording starts at: {streamer.next_timestamp()}") for topic, msg in streamer: # Processes GPS and IMU in perfect chronological order print(f"[{topic}] at {msg.timestamp_ns}: {type(msg.data).__name__}") # Once done, close the reading channel (recommended) seq_handler.close() ``` #### `TopicDataStreamer` (Targeted Access)¶ API Reference: `mosaicolabs.handlers.TopicDataStreamer`. The **`TopicDataStreamer`** provides a dedicated, high-throughput channel for interacting with a single data resource. By bypassing the complex synchronization logic required for merging multiple topics, it offers the lowest possible overhead for tasks requiring isolated data streams, such as training models on specific camera frames or IMU logs. To ensure efficiency, the streamer supports the following features: * **Temporal Slicing**: Much like the `SequenceDataStreamer`, you can extract data in a time-windowed fashion by specifying `start_timestamp_ns` and `end_timestamp_ns`. This ensures that only the relevant portion of the stream is retrieved rather than the entire dataset. * **Smart Buffering**: Data is not downloaded all at once; instead, the SDK retrieves information in memory-limited batches, substituting old data with new batches as you iterate to maintain a constant, minimal memory footprint. This example demonstrates how to initiate and use the Topic data stream. ``` import sys from mosaicolabs import MosaicoClient, IMU with MosaicoClient.connect("localhost", 6726) as client: # Retrieve the topic handler using (e.g.) MosaicoClient top_handler = client.topic_handler("mission_alpha", "/front/imu") if top_handler: # Start a Targeted Stream for single-sensor replay imu_stream = top_handler.get_data_streamer( # Optionally set the time window to extract start_timestamp_ns=1738508778000000000, end_timestamp_ns=1738509618000000000 ) # Peek at the start time print(f"Recording starts at: {streamer.next_timestamp()}") # Direct, low-overhead loop for imu_msg in imu_stream: process_sample(imu_msg.get_data(IMU)) # Some custom process function # Once done, close the reading channel (recommended) top_handler.close() ``` --- The **Query Module** provides a high-performance, **fluent** interface for discovering and filtering data within the Mosaico Data Platform. It is designed to move beyond simple keyword searches, allowing you to perform deep, semantic queries across metadata, system catalogs, and the physical content of sensor streams. A typical query workflow involves chaining methods within specialized builders to create a unified request that the server executes atomically. In the example below, the code orchestrates a multi-domain search to isolate high-interest data segments. Specifically, it queries for: * **Sequence Discovery**: Finds any recording session whose name contains the string `"test_drive"` **AND** where the custom user metadata indicates an `"environment.visibility"` value strictly less than 50. * **Topic Filtering**: Restricts the search specifically to the data channel named `"/front/camera/image"`. * **Ontology Analysis**: Performs a deep inspection of IMU sensor payloads to identify specific time segments where the **X-axis acceleration exceeds a certain threshold** while simultaneously the **Y-axis acceleration exceeds a certain threshold**. ``` from mosaicolabs import QueryOntologyCatalog, QuerySequence, QueryTopic, IMU, MosaicoClient # Establish a connection to the Mosaico Data Platform with MosaicoClient.connect("localhost", 6726) as client: # Perform a unified server-side query across multiple domains: qresponse = client.query( # Filter Sequence-level metadata QuerySequence() .with_name_match("test_drive") # Use convenience method for fuzzy name matching .with_expression( # Use the .Q proxy to filter the `user_metadata` field Sequence.Q.user_metadata["environment.visibility"].lt(50) ), # Search on topics with specific names QueryTopic() .with_name("/front/camera/image"), # Perform deep time-series discovery within sensor payloads QueryOntologyCatalog(include_timestamp_range=True) # Request temporal bounds for matches .with_expression(IMU.Q.acceleration.x.gt(5.0)) # Use the .Q proxy to filter the `acceleration` field .with_expression(IMU.Q.acceleration.y.gt(4.0)), ) # The server returns a QueryResponse grouped by Sequence for structured data management if qresponse is not None: for item in qresponse: # 'item.sequence' contains the name for the matched sequence print(f"Sequence: {item.sequence.name}") # 'item.topics' contains only the topics and time-segments # that satisfied the QueryOntologyCatalog criteria for topic in item.topics: # Access high-precision timestamps for the data segments found start, end = topic.timestamp_range.start, topic.timestamp_range.end print(f" Topic: {topic.name} | Match Window: {start} to {end}") ``` The provided example illustrates the core architecture of the Mosaico Query DSL. To effectively use this module, it is important to understand the two primary mechanisms that drive data discovery: * **Query Builders (Fluent Logic Collectors)**: Specialized builders like `QuerySequence`, `QueryTopic`, and `QueryOntologyCatalog` serve as containers for your search criteria. They provide a **Fluent Interface** where you can chain two types of methods: + **Convenience Methods**: High-level helpers for common fields, such as `with_name()`, `with_name_match()`, or `with_created_timestamp()`. + **Generic `with_expression()`**: A versatile method that accepts any expression obtained via the **`.Q` proxy**, allowing you to define complex filters for nested user metadata or deep sensor payloads. * **The `.Q` Proxy (Dynamic Model Inspection)**: Every `Serializable` model in the Mosaico ontology features a static `.Q` attribute. This proxy dynamically inspects the model's underlying schema to build dot-notated field paths and intercepts attribute access (e.g., `IMU.Q.acceleration.x`). When a terminal method is called—such as `.gt()`, `.lt()`, or `.between()`—it generates a type-safe **Atomic Expression** used by the platform to filter physical sensor data or metadata fields. By combining these mechanisms, the Query Module delivers a robust filtering experience: * **Multi-Domain Orchestration**: Execute searches across Sequence metadata, Topic configurations, and raw Ontology sensor data in a single, atomic request. * **Structured Response Management**: Results are returned in a `QueryResponse` that is automatically grouped by `Sequence`, making it easier to manage multi-sensor datasets. ## Query Execution & The Response Model¶ Queries are executed via the `query()` method exposed by the `MosaicoClient` class. When multiple builders are provided, they are combined with a logical **AND**. | Method | Return | Description | | --- | --- | --- | | `query(*queries, query)` | `Optional[QueryResponse]` | Executes one or more queries against the platform catalogs. The provided queries are joined in AND condition. The method accepts a variable arguments of query builder objects or a pre-constructed `Query` object. | The query execution returns a `QueryResponse` object, which behaves like a standard Python list containing `QueryResponseItem` objects. | Class | Description | | --- | --- | | `QueryResponseItem` | Groups all matches belonging to the same **Sequence**. Contains a `QueryResponseItemSequence` and a list of related `QueryResponseItemTopic`. | | `QueryResponseItemSequence` | Represents a specific **Sequence** where matches were found. It includes the sequence name. | | `QueryResponseItemTopic` | Represents a specific **Topic** where matches were found. It includes the normalized topic path and the optional `timestamp_range` (the first and last occurrence of the condition). | ``` import sys from mosaicolabs import MosaicoClient, QueryOntologyCatalog from mosaicolabs.models.sensors import IMU # Establish a connection to the Mosaico Data Platform with MosaicoClient.connect("localhost", 6726) as client: # Define a Deep Data Filter using the .Q Query Proxy # We are searching for vertical impact events where acceleration.z > 15.0 m/s^2 impact_qbuilder = QueryOntologyCatalog( IMU.Q.acceleration.z.gt(15.0), # include_timestamp_range returns the precise start/end of the matching event include_timestamp_range=True ) # Execute the query via the client results = client.query(impact_qbuilder) # The same can be obtained by using the Query object # results = client.query( # query = Query( # impact_qbuilder # ) # ) if results is not None: # Parse the structured QueryResponse object # Results are automatically grouped by Sequence for easier data management for item in results: print(f"Sequence: {item.sequence.name}") # Iterate through matching topics within the sequence for topic in item.topics: # Topic names are normalized (sequence prefix is stripped) for direct use print(f" - Match in: {topic.name}") # Extract the temporal bounds of the event if topic.timestamp_range: start = topic.timestamp_range.start end = topic.timestamp_range.end print(f" Occurrence: {start} ns to {end} ns") ``` * **Temporal Windows**: The `timestamp_range` provides the first and last occurrence of the queried condition within a topic, allowing you to slice data accurately for further analysis. * **Result Normalization**: `topic.name` returns the relative topic path (e.g., `/sensors/imu`), making it immediately compatible with other SDK methods like `topic_handler()`. ### Restricted Queries (Chaining)¶ The `QueryResponse` class enables a powerful mechanism for **iterative search refinement** by allowing you to convert your current results back into a new query builder. This approach is essential for resolving complex, multi-modal dependencies where a single monolithic query would be logically ambiguous, inefficient or technically impossible. | Method | Return Type | Description | | --- | --- | --- | | `to_query_sequence()` | `QuerySequence` | Returns a query builder pre-filtered to include only the **sequences** present in the response. | | `to_query_topic()` | `QueryTopic` | Returns a query builder pre-filtered to include only the specific **topics** identified in the response. | When you invoke these factory methods, the SDK generates a new query expression containing an explicit `$in` filter populated with the identifiers held in the current response. This effectively **"locks" the search domain**, allowing you to apply new criteria to a restricted subset of your data without re-scanning the entire platform catalog. ``` from mosaicolabs import MosaicoClient, QueryTopic, QueryOntologyCatalog, GPS, String with MosaicoClient.connect("localhost", 6726) as client: # Broad Search: Find all sequences where a GPS sensor reached a high-precision state (status=2) initial_response = client.query( QueryOntologyCatalog(GPS.Q.status.status.eq(2)) ) # 'initial_response' now acts as a filtered container of matching sequences. # Domain Locking: Restrict the search scope to the results of the initial query if not initial_response.is_empty(): # .to_query_sequence() generates a QuerySequence pre-filled with the matching sequence names. refined_query_builder = initial_response.to_query_sequence() # Targeted Refinement: Search for error patterns ONLY within the restricted domain # This ensures the platform only scans for '[ERR]' strings within sequences already validated for GPS precision. final_response = client.query( refined_query_builder, # The "locked" sequence domain QueryTopic().with_name("/localization/log_string"), # Target a specific log topic QueryOntologyCatalog(String.Q.data.match("[ERR]")) # Filter by exact data content pattern ) ``` When a specific set of topics has been identified through a data-driven query (e.g., finding every camera topic that recorded a specific event), you can use `to_query_topic()` to "lock" your next search to those specific data channels. This is particularly useful when you need to verify a condition on a very specific subset of sensors across many sequences, bypassing the need to re-identify those topics in the next step. In the next example, we first find all topics of a specific channel from a specific sequence name pattern, and then search specifically within *those* topics for any instances where the data content matches a specific pattern. ``` from mosaicolabs import MosaicoClient, QueryTopic with MosaicoClient.connect("localhost", 6726) as client: # Broad Search: Find sequences with high-precision GPS initial_response = client.query( QueryTopic().with_name("/localization/log_string"), # Target a specific log topic QuerySequence().with_name_match("test_winter_2025_") # Filter by sequence name pattern ) # Chaining: Use results to "lock" the domain and find specific log-patterns in those sequences if not initial_response.is_empty(): final_response = client.query( initial_response.to_query_topic(), # The "locked" topic domain QueryOntologyCatalog(String.Q.data.match("[ERR]")) # Filter by content ) ``` #### When Chaining is Necessary¶ The previous example of the `GPS.status` query and the subsequent `/localization/log_string` topic search highlight exactly when *query chaining* becomes a technical necessity rather than just a recommendation. In the Mosaico Data Platform, a single `client.query()` call applies a logical **AND** across all provided builders to locate individual **data streams (topics)** that satisfy every condition simultaneously. Because a single topic cannot physically represent two different sensor types at once, such as being both a `GPS` sensor and a `String` log, a monolithic query attempting to filter for both on the same stream will inherently return zero results. Chaining resolves this by allowing you to find the correct **Sequence** context in step one, then "locking" that domain to find a different **Topic** within that same context in step two. ``` # AMBIGUOUS: This looks for ONE topic that is BOTH GPS and String response = client.query( QueryOntologyCatalog(GPS.Q.status.status.eq(DGPS_FIX)), QueryOntologyCatalog(String.Q.data.match("[ERR]")), QueryTopic().with_name("/localization/log_string") ) ``` ## Architecture¶ ### Query Layers¶ Mosaico organizes data into three distinct architectural layers, each with its own specialized Query Builder: #### `QuerySequence` (Sequence Layer)¶ API Reference: `mosaicolabs.models.query.builders.QuerySequence`. Filters recordings based on high-level session metadata, such as the sequence name or the time it was created. **Example** Querying for sequences by name and creation date ``` from mosaicolabs import MosaicoClient, Topic, QuerySequence with MosaicoClient.connect("localhost", 6726) as client: # Search for sequences by project name and creation date qresponse = client.query( QuerySequence() .with_name_match("test_drive") .with_expression(Sequence.Q.user_metadata["project"].eq("Apollo")) .with_created_timestamp(time_start=Time.from_float(1690000000.0)) ) # Inspect the response for item in qresponse: print(f"Sequence: {item.sequence.name}") print(f"Topics: {[topic.name for topic in item.topics]}") ``` #### `QueryTopic` (Topic Layer)¶ API Reference: `mosaicolabs.models.query.builders.QueryTopic`. Targets specific data channels within a sequence. You can search for topics by name pattern or by their specific Ontology type (e.g., "Find all GPS topics"). **Example** Querying for image topics by ontology tag, metadata key and topic creation timestamp ``` from mosaicolabs import MosaicoClient, Image, Topic, QueryTopic with MosaicoClient.connect("localhost", 6726) as client: # Query for all 'image' topics created in a specific timeframe, matching some metadata (key, value) pair qresponse = client.query( QueryTopic() .with_ontology_tag(Image.ontology_tag()) .with_created_timestamp(time_start=Time.from_float(1700000000)) .with_expression(Topic.Q.user_metadata["camera_id.serial_number"].eq("ABC123_XYZ")) ) # Inspect the response if qresponse is not None: # Results are automatically grouped by Sequence for easier data management for item in qresponse: print(f"Sequence: {item.sequence.name}") print(f"Topics: {[topic.name for topic in item.topics]}") ``` #### `QueryOntologyCatalog` (Ontology Catalog Layer)¶ API Reference: `mosaicolabs.models.query.builders.QueryOntologyCatalog`. Filters based on the **actual time-series content** of the sensors (e.g., "Find events where `acceleration.z` exceeded a specific value"). **Example** Querying for mixed sensor data ``` from mosaicolabs import MosaicoClient, QueryOntologyCatalog, GPS, IMU with MosaicoClient.connect("localhost", 6726) as client: # Chain multiple sensor filters together qresponse = client.query( QueryOntologyCatalog() .with_expression(GPS.Q.status.satellites.geq(8)) .with_expression(Temperature.Q.value.between([273.15, 373.15])) .with_expression(Pressure.Q.value.geq(100000)) ) # Inspect the response if qresponse is not None: # Results are automatically grouped by Sequence for easier data management for item in qresponse: print(f"Sequence: {item.sequence.name}") print(f"Topics: {[topic.name for topic in item.topics]}") # Filter for a specific component value and extract the first and last occurrence times qresponse = client.query( QueryOntologyCatalog(include_timestamp_range=True) .with_expression(IMU.Q.acceleration.x.lt(-4.0)) .with_expression(IMU.Q.acceleration.y.gt(5.0)) .with_expression(Pose.Q.rotation.z.geq(0.707)) ) # Inspect the response if qresponse is not None: # Results are automatically grouped by Sequence for easier data management for item in qresponse: print(f"Sequence: {item.sequence.name}") print(f"Topics: {{topic.name: [topic.timestamp_range.start, topic.timestamp_range.end] for topic in item.topics}}") ``` The Mosaico Query Module offers two distinct paths for defining filters, **Convenience Methods** and **Generic Expression Method**, both of which support **method chaining** to compose multiple criteria into a single query using a logical **AND**. #### Convenience Methods¶ The query layers provide high-level fluent helpers (`with_`), built directly into the query builder classes and designed for ease of use. They allow you to filter data without deep knowledge of the internal model schema. The builder automatically selects the appropriate field and operator (such as exact match vs. substring pattern) based on the method used. ``` from mosaicolabs import QuerySequence, QueryTopic, RobotJoint # Build a filter with name pattern qbuilder = QuerySequence() .with_name_match("test_drive") # Execute the query qresponse = client.query(qbuilder) # Inspect the response if qresponse is not None: # Results are automatically grouped by Sequence for easier data management for item in qresponse: print(f"Sequence: {item.sequence.name}") print(f"Topics: {[topic.name for topic in item.topics]}") # Build a filter with ontology tag AND a specific creation time window qbuilder = QueryTopic() .with_ontology_tag(RobotJoint.ontology_tag()) .with_created_timestamp(start=t1, end=t2) # Execute the query qresponse = client.query(qbuilder) # Inspect the response if qresponse is not None: # Results are automatically grouped by Sequence for easier data management for item in qresponse: print(f"Sequence: {item.sequence.name}") print(f"Topics: {[topic.name for topic in item.topics]}") ``` * **Best For**: Standard system-level fields like Names and Timestamps. #### Generic Expression Method¶ The `with_expression()` method accepts raw **Query Expressions** generated through the `.Q` proxy. This provides full access to every supported operator (`.gt()`, `.lt()`, `.between()`, etc.) for specific fields. ``` from mosaicolabs import QueryOntologyCatalog, QuerySequence, IMU # Build a filter with name pattern and metadata-related expression qbuilder = QuerySequence() .with_expression( # Use query proxy for generating a QueryExpression Sequence.Q.user_metadata['environment.visibility'].lt(50) ) # Can be AND-chained with convenience methods .with_name_match("test_drive") # Execute the query qresponse = client.query(qbuilder) # Inspect the response if qresponse is not None: # Results are automatically grouped by Sequence for easier data management for item in qresponse: print(f"Sequence: {item.sequence.name}") print(f"Topics: {[topic.name for topic in item.topics]}") # Build a filter with deep time-series data discovery and measurement time windowing qbuilder = QueryOntologyCatalog() .with_expression(IMU.Q.acceleration.x.gt(5.0)) .with_expression(IMU.Q.header.stamp.sec.gt(1700134567)) .with_expression(IMU.Q.header.stamp.nanosec.between([123456, 789123])) # Execute the query qresponse = client.query(qbuilder) # Inspect the response if qresponse is not None: # Results are automatically grouped by Sequence for easier data management for item in qresponse: print(f"Sequence: {item.sequence.name}") print(f"Topics: {[topic.name for topic in item.topics]}") ``` * **Best For**: Accessing specific Ontology data fields (e.g., acceleration, position, etc.) and custom `user_metadata` in `Sequence` and `Topic` data models. ### The `.Q` Proxy Mechanism¶ The Query Proxy is the cornerstone of Mosaico's type-safe data discovery. Every data model in the Mosaico Ontology (e.g., `IMU`, `GPS`, `Image`) is automatically injected with a static `.Q` attribute during class initialization. This mechanism transforms static data structures into dynamic, fluent interfaces for constructing complex filters. The proxy follows a three-step lifecycle to ensure that your queries are both semantically correct and high-performance: 1. **Intelligent Mapping**: During system initialization, the proxy inspects the sensor's schema recursively. It maps every nested field path (e.g., `"acceleration.x"`) to a dedicated *queryable* object, i.e. an object providing comparison operators and expression generation methods. 2. **Type-Aware Operators**: The proxy identifies the data type of each field (numeric, string, dictionary, or boolean) and exposes only the operators valid for that type. This prevents logical errors, such as attempting a substring `.match()` on a numeric acceleration value. 3. **Intent Generation**: When you invoke an operator (e.g., `.gt(15.0)`), the proxy generates a `QueryExpression`. This object encapsulates your search intent and is serialized into an optimized JSON format for the platform to execute. To understand how the proxy handles nested structures, inherited attributes, and data types, consider the `IMU` ontology class: ``` class IMU(Serializable, HeaderMixin): acceleration: Vector3d # Composed type: contains x, y, z angular_velocity: Vector3d # Composed type: contains x, y, z orientation: Optional[Quaternion] = None # Composed type: contains x, y, z, w ``` The `.Q` proxy enables you to navigate the data exactly as it is defined in the model. By following the `IMU.Q` instruction, you can drill down through nested fields and inherited mixins using standard dot notation until you reach a base queryable type. The proxy automatically flattens the hierarchy, including fields inherited from `HeaderMixin` (like `frame_id` and `stamp`), assigning the correct queryable type and operators to each leaf node: (API Reference: `mosaicolabs.models.sensors.IMU`) | Proxy Field Path | Queryable Type | Supported Operators (Examples) | | --- | --- | --- | | **`IMU.Q.acceleration.x/y/z`** | **Numeric** | `.gt()`, `.lt()`, `.geq()`, `.leq()`, `.eq()`, `.between()`, `.in_()` | | **`IMU.Q.angular_velocity.x/y/z`** | **Numeric** | `.gt()`, `.lt()`, `.geq()`, `.leq()`, `.eq()`, `.between()`, `.in_()` | | **`IMU.Q.orientation.x/y/z/w`** | **Numeric** | `.gt()`, `.lt()`, `.geq()`, `.leq()`, `.eq()`, `.between()`, `.in_()` | | **`IMU.Q.header.frame_id`** | **String** | `.eq()`, `.match()` | | **`IMU.Q.header.stamp.sec`** | **Numeric** | `.gt()`, `.lt()`, `.geq()`, `.leq()`, `.eq()`, `.between()`, `.in_()` | | **`IMU.Q.header.stamp.nanosec`** | **Numeric** | `.gt()`, `.lt()`, `.geq()`, `.leq()`, `.eq()`, `.between()`, `.in_()` | The following table lists the supported operators for each data type: | Data Type | Operators | | --- | --- | | **Numeric** | `.eq()`, `.neq()`, `.lt()`, `.leq()`, `.gt()`, `.geq()`, `.between()`, `.in_()` | | **String** | `.eq()`, `.neq()`, `.match()` (i.e. substring), `.in_()` | | **Boolean** | `.eq(True/False)` | | **Dictionary** | `.eq()`, `.neq()`, `.lt()`, `.leq()`, `.gt()`, `.geq()`, `.between()`, `.in_()`, `.match()` | #### Supported vs. Unsupported Types¶ While the `.Q` proxy is highly versatile, it enforces specific rules on which data structures can be queried: * **Supported Types**: The proxy resolves all simple (int, float, str, bool) or composed types (like `Vector3d` or `Quaternion`). It will continue to expose nested fields as long as they lead to a primitive base type. * **Dictionaries**: Dynamic fields, such as the `user_metadata` found in the **`Topic`** and **`Sequence`** platform models, are fully queryable through the proxy using bracket notation (e.g., `Topic.Q.user_metadata["key"]` or `Topic.Q.user_metadata["key.subkey.subsubkey"]`). This approach provides the flexibility to search across custom tags and dynamic properties that aren't part of a fixed schema. This dictionary-based querying logic is not restricted to platform models; it applies to any **custom ontology model** created by the user that contains a `dict` field. + **Syntax**: Instead of the standard dot notation used for fixed fields, you must use square brackets `["key"]` to target specific dictionary entries. + **Nested Access**: For dictionaries containing nested structures, you can use **dot notation within the key string** (e.g., `["environment.visibility"]`) to traverse sub-fields. + **Operator Support**: Because dictionary values are dynamic, these fields are "promiscuous," meaning they support all available numeric, string, and boolean operators without strict SDK-level type checking. * **Unsupported Types (Lists and Tuples)**: Any field defined as a container, such as a **List** or **Tuple** (e.g., `covariance: List[float]`), is currently skipped by the proxy generator. These fields will not appear in autocomplete and cannot be used in a query expression. ## Constraints & Limitations¶ While fully functional, the current implementation (v0.x) has a **Single Occurrence Constraint**. * **Constraint**: A specific data field path may appear **only once** within a single query builder instance. You cannot chain two separate conditions on the same field (e.g., `.gt(0.5)` and `.lt(1.0)`). ``` # INVALID: The same field (acceleration.x) is used twice in the constructor QueryOntologyCatalog() \ .with_expression(IMU.Q.acceleration.x.gt(0.5)) .with_expression(IMU.Q.acceleration.x.lt(1.0)) # <- Error! Duplicate field path ``` * **Solution**: Use the built-in **`.between([min, max])`** operator to perform range filtering on a single field path. * **Note**: You can still query multiple *different* fields from the same sensor model (e.g., `acceleration.x` and `acceleration.y`) in one builder. ``` # VALID: Each expression targets a unique field path QueryOntologyCatalog( IMU.Q.acceleration.x.gt(0.5), # Unique field IMU.Q.acceleration.y.lt(1.0), # Unique field IMU.Q.angular_velocity.x.between([0, 1]), # Correct way to do ranges include_timestamp_range=True ) ``` --- The **Mosaico ML** module serves as the high-performance bridge between the Mosaico Data Platform and the modern Data Science ecosystem. While the platform is optimized for high-speed raw message streaming, this module provides the abstractions necessary to transform asynchronous sensor data into tabular formats compatible with **Physical AI**, **Deep Learning**, and **Predictive Analytics**. Working with robotics and multi-modal datasets presents three primary technical hurdles that the ML module is designed to solve: * **Heterogeneous Sampling**: Sensors like LIDAR (low frequency), IMU (high frequency), and GPS (intermittent) operate at different rates. * **High Volume**: Datasets often exceed the available system RAM. * **Nested Structures**: Robotics data is typically deeply nested with coordinate transformations and covariance matrices. ## From Sequences to DataFrames¶ API Reference: `mosaicolabs.ml.DataFrameExtractor` The `DataFrameExtractor` is a specialized utility designed to convert Mosaico sequences into tabular formats. Unlike standard streamers that instantiate individual Python objects, this extractor operates at the **Batch Level** by pulling raw `RecordBatch` objects directly from the underlying stream to maximize throughput. ### Key Technical Features¶ * **Recursive Flattening**: Automatically "unpacks" deeply nested Mosaico Ontology structures into primitive columns. * **Semantic Naming**: Columns use a `{topic_name}.{ontology_tag}.{field_path}` convention (e.g., `/front/camera/imu.imu.acceleration.x`) to remain self-describing. * **Namespace Isolation**: Topic names are included in column headers to prevent collisions when multiple sensors of the same type are present. * **Memory-Efficient Windowing**: Uses a generator-based approach to yield data in time-based "chunks" (e.g., 5-second windows) while handling straddling batches via a carry-over buffer. * **Sparse Merging**: Creates a "sparse" DataFrame containing the union of all timestamps, using `NaN` for missing sensor readings at specific intervals. This example demonstrates iterating through a sequence in 10-second tabular chunks. ``` from mosaicolabs import MosaicoClient from mosaicolabs.ml import DataFrameExtractor with MosaicoClient.connect("localhost", 6726): # Initialize from an existing SequenceHandler seq_handler = client.sequence_handler("drive_session_01") extractor = DataFrameExtractor(seq_handler) # Iterate through 10-second chunks for df in extractor.to_pandas_chunks(window_sec=10.0): # 'df' is a pandas DataFrame with semantic columns # Example: df["/front/camera/imu.imu.acceleration.x"] print(f"Processing chunk with {len(df)} rows") ``` For complex types like images that require specialized decoding, Mosaico allows you to "inflate" a flattened DataFrame row back into a strongly-typed `Message` object. ``` from mosaicolabs import MosaicoClient from mosaicolabs.ml import DataFrameExtractor from mosaicolabs.models import Message, Image with MosaicoClient.connect("localhost", 6726): # Initialize from an existing SequenceHandler seq_handler = client.sequence_handler("drive_session_01") extractor = DataFrameExtractor(seq_handler) # Get data chunks for df in extractor.to_pandas_chunks(topics=["/sensors/front/image_raw"]): for _, row in df.iterrows(): # Reconstruct the full Message (envelope + payload) from a row img_msg = Message.from_dataframe_row( row=row, topic_name="/sensors/front/image_raw", ) if img_msg: img = img_msg.get_data(Image).to_pillow() # Access typed fields with IDE autocompletion print(f"Time: {img_msg.timestamp_ns}") img.show() ``` ## Sparse to Dense Representation¶ API Reference: `mosaicolabs.ml.SyncTransformer` The `SyncTransformer` is a temporal resampler designed to solve the **Heterogeneous Sampling** problem inherent in robotics and Physical AI. It aligns multi-rate sensor streams (for example, an IMU at 100Hz and a GPS at 5Hz) onto a uniform, fixed-frequency grid to prepare them for machine learning models. The `SyncTransformer` operates as a processor that bridges the gaps between windowed chunks yielded by the `DataFrameExtractor`. Unlike standard resamplers that treat each data batch in isolation, this transformer maintains internal state to ensure signal continuity across batch boundaries. ### Key Design Principles¶ * **Stateful Continuity**: It maintains an internal cache of the last known sensor values and the next expected grid tick, allowing signals to bridge the gap between independent DataFrame chunks. * **Semantic Integrity**: It respects the physical reality of data acquisition by yielding `None` for grid ticks that occur before a sensor's first physical measurement, avoiding data "hallucination". * **Vectorized Performance**: Internal kernels leverage high-speed lookups for high-throughput processing. * **Protocol-Based Extensibility**: The mathematical logic for resampling is decoupled through a `SynchPolicy` protocol, allowing for custom kernel injection. ### Implemented Synchronization Policies¶ API Reference: `mosaicolabs.ml.SyncPolicy` Each policy defines a specific logic for how the transformer bridges temporal gaps between sparse data points. #### 1. **`SyncHold`** (Last-Value-Hold)¶ * **Behavior**: Finds the most recent valid measurement and "holds" it constant until a new one arrives. * **Best For**: Sensors where states remain valid until explicitly changed, such as robot joint positions or battery levels. #### 2. **`SyncAsOf`** (Staleness Guard)¶ * **Behavior**: Carries the last known value forward only if it has not exceeded a defined maximum "tolerance" (fresher than a specific age). * **Best For**: High-speed signals that become unreliable if not updated frequently, such as localization coordinates. #### 3. **`SyncDrop`** (Interval Filter)¶ * **Behavior**: Ensures a grid tick only receives a value if a new measurement actually occurred within that specific grid interval; otherwise, it returns `None`. * **Best For**: Downsampling high-frequency data where a strict 1-to-1 relationship between windows and unique hardware events is required. ### Scikit-Learn Compatibility¶ By implementing the standard `fit`/`transform` interface, the `SyncTransformer` makes robotics data a "first-class citizen" of the Scikit-learn ecosystem. This allows for the plug-and-play integration of multi-rate sensor data into standard pipelines. ``` from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from mosaicolabs import MosaicoClient from mosaicolabs.ml import DataFrameExtractor, SyncTransformer, SynchHold # Define a pipeline for physical AI preprocessing pipeline = Pipeline([ ('sync', SyncTransformer(target_fps=30.0, policy=SynchHold())), ('scaler', StandardScaler()) ]) with MosaicoClient.connect("localhost", 6726): # Initialize from an existing SequenceHandler seq_handler = client.sequence_handler("drive_session_01") extractor = DataFrameExtractor(seq_handler) # Process sequential chunks while maintaining signal continuity for sparse_chunk in extractor.to_pandas_chunks(window_sec=5.0): # The transformer automatically carries state across sequential calls normalized_dense_chunk = pipeline.transform(sparse_chunk) ``` --- The **ROS Bridge** module serves as the ingestion gateway for ROS (Robot Operating System) data into the Mosaico Data Platform. Its primary function is to solve the interoperability challenges associated with ROS bag files—specifically format fragmentation (ROS 1 `.bag` vs. ROS 2 `.mcap`/`.db3`) and the lack of strict schema enforcement in custom message definitions. The core philosophy of the module is **"Adaptation, Not Just Parsing."** Rather than simply extracting raw dictionaries from ROS messages, the bridge actively translates them into the standardized **Mosaico Ontology**. For example, a `geometry_msgs/Pose` is validated, normalized, and instantiated as a strongly-typed `mosaicolabs.models.data.Pose` object before ingestion. ## Architecture¶ The module is composed of four distinct layers that handle the pipeline from raw file access to server transmission. ### The Loader Layer (`ROSLoader`)¶ The `ROSLoader` acts as the abstraction layer over the physical bag files. It utilizes the `rosbags` library to provide a unified interface for reading both ROS 1 and ROS 2 formats (`.bag`, `.db3`, `.mcap`). * **Responsibilities:** File I/O, raw deserialization, and topic filtering (supporting glob patterns like `/cam/*`). * **Error Handling:** It implements configurable policies (`IGNORE`, `LOG_WARN`, `RAISE`) to handle corrupted messages or deserialization failures without crashing the entire pipeline. ### The Adaptation Layer (`ROSBridge` & Adapters)¶ This layer represents the semantic core of the module, translating raw ROS data into the Mosaico Ontology. * **`ROSAdapterBase`:** An abstract base class that establishes the contract for converting specific ROS message types into their corresponding Mosaico Ontology types. * **Concrete Adapters:** The library provides built-in implementations for common standards, such as `IMUAdapter` (mapping `sensor_msgs/Imu` to `IMU`) and `ImageAdapter` (mapping `sensor_msgs/Image` to `Image`). These adapters include advanced logic for recursive unwrapping, automatically extracting data from complex nested wrappers like `PoseWithCovarianceStamped`. Developers can also implement custom adapters to handle non-standard or proprietary types. * **`ROSBridge`:** A central registry and dispatch mechanism that maps ROS message type strings (e.g., `sensor_msgs/msg/Imu`) to their corresponding adapter classes, ensuring the correct translation logic is applied for each message. #### Extending the Bridge (Custom Adapters)¶ Users can extend the bridge to support new ROS message types by implementing a custom adapter and registering it. 1. **Inherit from `ROSAdapterBase`**: Define the input ROS type string and the target Mosaico Ontology type. 2. **Implement `from_dict`**: Define the logic to convert the `ROSMessage.data` dictionary into an intance of the target ontology object. 3. **Register**: Decorate the class with `@register_adapter`. ``` from mosaicolabs.ros_bridge import ROSAdapterBase, register_adapter, ROSMessage from mosaicolabs.models import Message from my_ontology import MyCustomData # Assuming this class exists @register_adapter class MyCustomAdapter(ROSAdapterBase[MyCustomData]): ros_msgtype = "my_pkg/msg/MyCustomType" __mosaico_ontology_type__ = MyCustomData @classmethod def from_dict(cls, ros_data: dict) -> MyCustomData: # Transformation logic here return MyCustomData(...) ``` ### The Orchestrator (`RosbagInjector`)¶ The **`RosbagInjector`** is the central command center of the ROS Bridge module. It is designed to be the primary entry point for developers who want to embed high-performance ROS ingestion directly into their Python applications or automation scripts. The injector acts as a "glue" layer, orchestrating the interaction between the **`ROSLoader`** (file access), the **`ROSBridge`** (data adaptation), and the **`MosaicoClient`** (network transmission). It handles the complex lifecycle of a data upload—including connection management, batching, and transaction safety—while providing real-time feedback through a visual CLI interface. #### Core Workflow Execution: `run()`¶ The `run()` method is the heart of the injector. When called, it initiates a multi-phase pipeline: 1. **Handshake & Registry**: Establishes a connection to the Mosaico server and registers any provided custom `.msg` definitions into the global `ROSTypeRegistry`. 2. **Sequence Creation**: Requests the server to initialize a new data sequence based on the provided name and metadata. 3. **Adaptive Streaming**: Iterates through the ROS bag records. For each message, it identifies the correct adapter, translates the ROS dictionary into a Mosaico object, and pushes it into an optimized asynchronous write buffer. 4. **Transaction Finalization**: Once the bag is exhausted, it flushes all remaining buffers and signals the server to commit the sequence. #### The Blueprint: `ROSInjectionConfig`¶ The behavior of the injector is entirely driven by the **`ROSInjectionConfig`**. This configuration object ensures that the ingestion logic is decoupled from the user interface, allowing for consistent behavior whether triggered via the CLI or a complex script. | Attribute | Type | Description | | --- | --- | --- | | **`file_path`** | `Path` | The location of the source ROS bag (`.mcap`, `.db3`, or `.bag`). | | **`sequence_name`** | `str` | The unique identifier for the sequence on the server. | | **`metadata`** | `dict` | Searchable tags and context (e.g., `{"weather": "rainy"}`) attached to the sequence. | | **`ros_distro`** | `Stores` | **Crucial for `.db3` bags:** Specifies the ROS distribution (e.g., `ROS2_HUMBLE`) to ensure standard messages are parsed with the correct schema version. | | **`topics`** | `List[str]` | A filter list supporting glob patterns (e.g., `["/camera/*"]`). If omitted, all supported topics are ingested. | | **`custom_msgs`** | `List` | A list of tuples `(package, path, store)` used to dynamically register proprietary message definitions at runtime. | | **`on_error`** | `OnErrorPolicy` | **Safety Switch:** Determines if a failed upload should `Delete` the partial sequence or `Report` the error and keep the data. | | **`log_level`** | `str` | Controls terminal verbosity, ranging from `DEBUG` to `ERROR`. | #### Practical Example: Programmatic Usage¶ ``` from pathlib import Path from mosaicolabs.ros_bridge import RosbagInjector, ROSInjectionConfig, Stores def run_injection(): # Define the Injection Configuration # This data class acts as the single source for the operation. config = ROSInjectionConfig( # Input Data file_path=Path("data/session_01.db3"), # Target Platform Metadata sequence_name="test_ros_sequence", metadata={ "driver_version": "v2.1", "weather": "sunny", "location": "test_track_A" }, # Topic Filtering (supports glob patterns) # This will only upload topics starting with '/cam' topics=["/cam*"], # ROS Configuration # Specifying the distro ensures correct parsing of standard messages # (.db3 sqlite3 rosbags need the specification of distro) ros_distro=Stores.ROS2_HUMBLE, # Custom Message Registration # Register proprietary messages before loading to prevent errors custom_msgs=[ ( "my_custom_pkg", # ROS Package Name Path("./definitions/my_pkg/"), # Path to directory containing .msg files Stores.ROS2_HUMBLE, # Scope (valid for this distro) ) # registry will automatically infer type names as `my_custom_pkg/msg/{filename}` ], # Execution Settings log_level="WARNING", # Reduce verbosity for automated scripts ) # Instantiate the Controller injector = RosbagInjector(config) # Execute # The run method handles connection, loading, and uploading automatically. # It raises exceptions for fatal errors, allowing you to wrap it in try/except blocks. try: injector.run() print("Injection job completed successfully.") except Exception as e: print(f"Injection job failed: {e}") # Use as script or call the injection function in your code if __name__ == "__main__": run_injection() ``` #### CLI Usage¶ The module includes a command-line interface for quick ingestion tasks. The full list of options can be retrieved by running `mosaico.ros_injector -h` ``` # Basic Usage poetry run mosaico.ros_injector ./data.mcap --name "Test_Run_01" # Advanced Usage: Filtering topics and adding metadata poetry run mosaico.ros_injector ./data.db3 \ --name "Test_Run_01" \ --topics /camera/front/* /gps/fix \ --metadata ./metadata.json \ --ros-distro ros2_humble ``` ### The Type Registry (`ROSTypeRegistry`)¶ The **`ROSTypeRegistry`** is a context-aware singleton designed to manage the schemas required to decode ROS data. ROS message definitions are frequently external to the data files themselves—this is especially true for ROS 2 `.db3` (SQLite) formats and proprietary datasets containing custom sensors. Without these definitions, the bridge cannot deserialize the raw binary "blobs" into readable dictionaries. * **Schema Resolution**: It allows the `ROSLoader` to resolve custom `.msg` definitions on-the-fly during bag playback. * **Version Isolation (Stores)**: ROS messages often vary across distributions (e.g., a "Header" in ROS 1 Noetic is structurally different from ROS 2 Humble). The registry uses a "Profile" system to store these version-specific definitions separately, preventing cross-distribution conflicts. * **Global vs. Scoped Definitions**: You can register definitions **Globally** (available to all loaders) or **Scoped** to a specific distribution. #### Pre-loading Definitions¶ While you can pass custom messages via `ROSInjectionConfig`, it can become cumbersome for large-scale projects with hundreds of proprietary types. The recommended approach is to pre-load the registry at the start of your application. This makes the definitions available to all subsequent loaders automatically. | Method | Scope | Description | | --- | --- | --- | | **`register(...)`** | Single Message | Registers a single custom type. The source can be a path to a `.msg` file or a raw string containing the definition. | | **`register_directory(...)`** | Batch Package | Scans a directory for all `.msg` files and registers them under a specific package name (e.g., `my_pkg/msg/Sensor`). | | **`get_types(...)`** | Internal | Implements a "Cascade" logic: merges Global definitions with distribution-specific overrides for a loader. | | **`reset()`** | Utility | Clears all stored definitions. Primarily used for unit testing to ensure process isolation. | #### Centralized Registration Example¶ A clean way to manage large projects is to centralize your message registration in a single setup function (e.g., `setup_registry.py`): ``` from pathlib import Path from mosaicolabs.ros_bridge import ROSTypeRegistry, Stores def initialize_project_schemas(): # 1. Register a proprietary message valid for all ROS versions ROSTypeRegistry.register( msg_type="common_msgs/msg/SystemHeartbeat", source=Path("./definitions/Heartbeat.msg") ) # 2. Batch register an entire package for ROS 2 Humble ROSTypeRegistry.register_directory( package_name="robot_v3_msgs", dir_path=Path("./definitions/robot_v3/msgs"), store=Stores.ROS2_HUMBLE ) ``` Once registered, the `RosbagInjector` (and the underlying `ROSLoader`) automatically detects and uses these definitions. There is no longer the need to pass the `custom_msgs` list in the `ROSInjectionConfig`. ``` # main_injection.py import setup_registry # Runs the registration logic above from mosaicolabs.ros_bridge import RosbagInjector, ROSInjectionConfig, Stores from pathlib import Path # Initialize registry setup_registry.initialize_project_schemas() # Configure injection WITHOUT listing custom messages again config = ROSInjectionConfig( file_path=Path("mission_data.mcap"), sequence_name="mission_01", metadata={"operator": "Alice"}, ros_distro=Stores.ROS2_HUMBLE, # Loader will pull the Humble-specific types we registered # custom_msgs=[] <-- No longer needed! ) injector = RosbagInjector(config) injector.run() ``` ### Testing & Validation¶ The ROS Bag Injection module has been validated against a variety of standard datasets to ensure compatibility with different ROS distributions, message serialization formats (CDR/ROS 1), and bag container formats (`.bag`, `.mcap`, `.db3`). #### Recommended Dataset for Verification¶ For evaluating Mosaico capabilities, we recommend the **NVIDIA NGC Catalog - R2B Dataset 2024**. This dataset has been verified to be fully compatible with the injection pipeline. The following table details the injection performance for the **NVIDIA R2B Dataset 2024**. These benchmarks were captured on a system running **macOS 26.2** with an **Apple M2 Pro (10 cores, 16GB RAM)**. #### NVIDIA R2B Dataset 2024 Injection Performance¶ | Sequence Name | Compression Factor | Injection Time | Hardware Architecture | Notes | | --- | --- | --- | --- | --- | | **`r2b_galileo2`** | ~70% | ~40 sec | Apple M2 Pro (16GB) | High compression achieved for telemetry data. | | **`r2b_galileo`** | ~1% | ~30 sec | Apple M2 Pro (16GB) | Low compression due to pre-compressed source images. | | **`r2b_robotarm`** | ~66% | ~50 sec | Apple M2 Pro (16GB) | High efficiency for high-frequency state updates. | | **`r2b_whitetunnel`** | ~1% | ~30 sec | Apple M2 Pro (16GB) | Low compression; contains topics with no available adapter. | #### Understanding Performance Factors¶ * **Compression Factors**: Sequences like `r2b_galileo2` achieve high ratios (~70%) because Mosaico optimizes the underlying columnar storage for scalar telemetry. Conversely, sequences with pre-compressed video feeds show minimal gains (~1%) because the data is already in a dense format. * **Injection Time**: This metric includes the overhead of local MCAP/DB3 deserialization via `ROSLoader`, semantic translation through the `ROSBridge`, and the asynchronous transmission to the Mosaico server. * **Hardware Impact**: On the **Apple M2 Pro**, the `RosbagInjector` utilizes multi-threading for the **Adaptation Layer**, allowing serialization tasks to run in parallel while the main thread manages the Flight stream. #### Known Issues & Limitations¶ While the underlying `rosbags` library supports the majority of standard ROS 2 bag files, specific datasets with non-standard serialization alignment or proprietary encodings may encounter compatibility issues. **NVIDIA Isaac ROS Benchmark Dataset (2023)** * **Source:** NVIDIA NGC Catalog - R2B Dataset 2023 * **Issue:** Deserialization failure during ingestion. * **Technical Details:** The ingestion process fails within the `AnyReader.deserialize` method of the `rosbags` library. The internal CDR deserializer triggers an assertion error indicating a mismatch in the expected data length vs. the raw payload size. * **Error Signature:** ``` # In rosbags.serde.cdr: assert pos + 4 + 3 >= len(rawdata) ``` * **Recommendation:** This issue originates in the upstream parser handling of this specific dataset's serialization alignment. It is currently recommended to exclude this dataset or transcode it using standard ROS 2 tools before ingestion. ## Supported Message Types¶ ***ROS-Specific Data Models*** In addition to mapping standard ROS messages to the core Mosaico ontology, the `ros-bridge` module implements two specialized data models. These are defined specifically for this module to handle ROS-native concepts that are not yet part of the official Mosaico standard: * **`FrameTransform`**: Designed to handle coordinate frame transformations (modeled after `tf2_msgs/msg/TFMessage`). It encapsulates a list of `Transform` objects to manage spatial relationships. * **`BatteryState`**: Modeled after `sensor_msgs/msg/BatteryState`), this class captures comprehensive power supply metrics. It includes core data (voltage, current, capacity, percentage) and detailed metadata such as power supply health, technology status, and individual cell readings. > **Note:** Although these are provisional additions, both `FrameTransform` and `BatteryState` inherit from `Serializable` and `HeaderMixin`. This ensures they remain fully compatible with Mosaico’s existing serialization and header management infrastructure. ### Supported Message Types Table¶ | ROS Message Type | Mosaico Ontology Type | Adapter | | --- | --- | --- | | `geometry_msgs/Pose`, `PoseStamped`... | `Pose` | `PoseAdapter` | | `geometry_msgs/Twist`, `TwistStamped`... | `Velocity` | `TwistAdapter` | | `geometry_msgs/Accel`, `AccelStamped`... | `Acceleration` | `AccelAdapter` | | `geometry_msgs/Vector3`, `Vector3Stamped` | `Vector3d` | `Vector3Adapter` | | `geometry_msgs/Point`, `PointStamped` | `Point3d` | `PointAdapter` | | `geometry_msgs/Quaternion`, `QuaternionStamped` | `Quaternion` | `QuaternionAdapter` | | `geometry_msgs/Transform`, `TransformStamped` | `Transform` | `TransformAdapter` | | `geometry_msgs/Wrench`, `WrenchStamped` | `ForceTorque` | `WrenchAdapter` | | `nav_msgs/Odometry` | `MotionState` | `OdometryAdapter` | | `nmea_msgs/Sentence` | `NMEASentence` | `NMEASentenceAdapter` | | `sensor_msgs/Image`, `CompressedImage` | `Image`, `CompressedImage` | `ImageAdapter`, `CompressedImageAdapter` | | `sensor_msgs/Imu` | `IMU` | `IMUAdapter` | | `sensor_msgs/NavSatFix` | `GPS`, `GPSStatus` | `GPSAdapter`, `NavSatStatusAdapter` | | `sensor_msgs/CameraInfo` | `CameraInfo` | `CameraInfoAdapter` | | `sensor_msgs/RegionOfInterest` | `ROI` | `ROIAdapter` | | `sensor_msgs/JointState` | `RobotJoint` | `RobotJointAdapter` | | `sensor_msgs/BatteryState` | `BatteryState` (ROS-specific) | `BatteryStateAdapter` | | `std_msgs/msg/String` | `String` | `_GenericStdAdapter` | | `std_msgs/msg/Int8(16,32,64)` | `Integer8(16,32,64)` | `_GenericStdAdapter` | | `std_msgs/msg/UInt8(16,32,64)` | `Unsigned8(16,32,64)` | `_GenericStdAdapter` | | `std_msgs/msg/Float32(64)` | `Floating32(64)` | `_GenericStdAdapter` | | `std_msgs/msg/Bool` | `Boolean` | `_GenericStdAdapter` | | `tf2_msgs/msg/TFMessage` | `FrameTransform` (ROS-specific) | `FrameTransformAdapter` | --- The **Mosaico Daemon**, a.k.a. `mosaicod`, acts as engine of the data platform. Developed in **Rust**, it is engineered to be the high-performance arbiter for all data interactions, guaranteeing that every byte of robotics data is strictly typed, atomically stored, and efficiently retrievable. It functions on a standard client-server model, mediating between your high-level applications (via the SDKs) and the low-level storage infrastructure. ## Architectural Design¶ `mosaicod` is architected atop the Apache Arrow Flight protocol. Apache Arrow Flight is a general-purpose, high-performance client-server framework developed for the exchange of massive datasets. It operates directly on Apache Arrow columnar data, enabling efficient transport over gRPC without the overhead of serialization. Unlike traditional REST APIs which serialize data into text-based JSON, Flight is designed specifically for high-throughput data systems. This architectural choice provides Mosaico with three critical advantages: **Zero-Copy Serialization.** Data is transmitted in the Arrow columnar format, the exact same format used in-memory by modern analytics tools like pandas and Polars. This eliminates the CPU-heavy cost of serializing and deserializing data at every hop. **Parallelized Transport.** Operations are not bound to a single pipe; data transfer can be striped across multiple connections to saturate available bandwidth. **Snapshot-Based Schema Enforcement.** Data types are not guessed, nor are they forced into a rigid global model. Instead, the protocol enforces a rigorous schema handshake that validates data against a specific schema snapshot stored with the sequence. ### Resource Addressing¶ Mosaico treats every entity in the system, whether it's a Sequence or a Topic, as a uniquely addressable resource. These resources are identified by a **Resource Locator**, a uniform logical path that remains consistent across all channels. Mosaico uses two types of resource locators: * A **Sequence Locator** identifies a recording session by its sequence name (e.g., `run_2023_01`). * A **Topic Locator** identifies a specific data stream using a hierarchical path that includes the sequence name and topic path (e.g., `run_2023_01/sensors/lidar_front`). ### Flight Endpoints¶ The daemon exposes Apache Arrow Flight endpoints that handle various operations using Flight's core methods: `list_flights` and `get_flight_info` for discovery and metadata management, `do_put` for high-speed data ingestion, and `do_get` for efficient data retrieval. This design ensures administrative operations don't interfere with data throughput while maintaining low-latency columnar data access. ### Storage Architecture¶ `mosaicod` uses a database to perform fast queries on metadata, manage system state such as sequence and topic definitions, and handle the event queue for processing asynchronous tasks like background data processing or notifications. An object store (such as S3, MinIO, or local filesystem) provides long-term storage for resilience and durability, holding the bulk sensor data, images, point clouds, and immutable schema snapshots that define data structures. Database Durability and Recovery The database state is entirely transient and can be fully reconstructed from the object store. This also enables importing data from other stores. *Currently, there is no way to import data and reconstruct the database, but we are designing the system to enable this feature in future releases.* If the metadata database is corrupted or destroyed, `mosaicod` can rebuild the entire catalog by rescanning the durable object storage. This design ensures that while the database provides performance, the store guarantees long-term durability and recovery, protecting your data against catastrophic infrastructure failure. --- # Setup¶ For rapid prototyping, we provide a Docker Compose configuration. This sets up a volatile environment that includes both the Mosaico server and a PostgreSQL database. ``` # Navigate to the quick start directory form the root folder cd docker/quick_start # Startup the infra in background docker compose up -d ``` This launches PostgreSQL on port `5432` and `mosaicod` on its **default port** `6726` Volatile storage The default Mosaico configuration uses non persistent storage. This means that if the container is destroyed, all stored data will be lost. Since Mosaico is still under active development, we provide this simple, volatile setup by default. For persistent storage, the standard `compose.yml` file can be easily extended to utilize a Docker volume. ## Building from Source¶ To build Mosaico for production, you need a Rust toolchain. Mosaico uses `sqlx` for compile-time query verification, which typically requires a live database connection. However, we support an offline build mode using cached metadata (`.sqlx` folder). ### Offline Build - Recommended¶ ``` SQLX_OFFLINE=true cargo build --release ``` The binary will be located at `target/release/mosaicod`. ### Live Migrations¶ If you need to modify the database schema, a running PostgreSQL instance is required. This allows `sqlx` to verify queries against a live database during compilation. You can use the provided Docker Compose file in `docker/devel` which sets up an instance of MinIO and a PostgreSQL database. First, start the development environment. From inside the `docker/devel` directory, run: ``` # Start the services in the background docker compose up -d # To stop and remove the volumes (which clears all data), run: docker compose down -v ``` Next, from the root of the `mosaicod` workspace, install the necessary tools, configure the environment, and run the build. ``` # Install the SQLx command-line tool cargo install sqlx-cli # Copy the development environment variables for the database connection cp env.devel .env # Apply the database migrations cargo sqlx migrate run # Finally, compile the project cargo build --release ``` ## Configuration¶ The server supports S3-compatible object storage by default but can be configured for local storage via command line options. ### Database¶ Mosaico requires a connection to a running **PostgreSQL** instance, which is defined via the `MOSAICO_REPOSITORY_DB_URL` environment variable. ### Remote Storage Configuration¶ For production deployments, `mosaicod` should be configured to use an S3-compatible object store (such as AWS S3, Google Cloud Storage, Hetzner Object Store, etc) for durable, long-term storage. This is configured through the following environment variables: | Environment Variable | Description | | --- | --- | | `MOSAICO_STORE_BUCKET` | The name of the S3 bucket where Mosaico will store all data blobs. This bucket must be created before starting the server. | | `MOSAICO_STORE_ENDPOINT` | The full URL endpoint for the S3-compatible service. This is necessary for non-AWS providers (e.g., `http://localhost:9000` for a local MinIO instance). | | `MOSAICO_STORE_ACCESS_KEY` | The access key ID for authenticating with your object storage service. | | `MOSAICO_STORE_SECRET_KEY` | The secret access key that corresponds to the provided access key ID, used for authentication. | ### Local Storage Configuration¶ This command will start a `mosaicod` instance using the local filesystem as storage layer. ``` mosaicod run --local-store /tmp/mosaicod ``` --- # Custom Actions¶ Mosaico implements its own administrative protocols directly on top of Apache Arrow Flight. Rather than relying on a separate control channel abstraction, Mosaico leverages the Flight `DoAction` RPC mechanism to handle discrete lifecycle events, administrative interfaces, and resource management. Unlike streaming endpoints designed for continuous data throughput, these custom actions manage the platform's overarching state. While individual calls are synchronous, they often initiate or conclude multi-step processes, such as topic upload, that govern the long-term integrity of data within the platform. All custom actions follow a standardized pattern: they expect a JSON-serialized payload defining the request parameters and return a JSON-serialized response containing the result. ## Sequence Management¶ Sequences are the fundamental containers for data recordings in Mosaico. These custom actions enforce a strict lifecycle state machine to guarantee data integrity. | Action | Description | | --- | --- | | `sequence_create` | Initializes a new, empty sequence. It generates and returns a unique key (UUID). This key acts as a write token, authorizing subsequent data ingestion into this specific sequence. This avoids concurrent access and creation issues when multiple clients attempt to create sequences simultaneously. | | `sequence_finalize` | Transitions a sequence from *uploading* to *archived*. This action locks the sequence, marking it as immutable. Once finalized, no further data can be added or modified, ensuring a perfect audit trail. | | `sequence_abort` | A cleanup operation for failed uploads. It discards a sequence that is currently being uploaded, purging any partial data from the storage to prevent *zombie* records. | | `sequence_delete` | Permanently removes a sequence from the platform. To protect data lineage, this is typically permitted only on unlocked (incomplete) sequences. | ## Topic Management¶ Topics represent the individual sensor streams (e.g., `camera/front`, `gps`) contained within a sequence. | Action | Description | | --- | --- | | `topic_create` | Registers a new topic. | | `topic_delete` | Removes a specific topic from a sequence, permitted only if the parent sequence is still unlocked. | ## Notification System¶ The platform includes a tagging mechanism to attach alerts or informational messages to resources. For example, if an exception is raised during an upload, the notification system automatically registers the event, ensuring the failure is logged and visible for troubleshooting. | Action | Description | | --- | --- | | `*_notify_create` | Attaches a notification to a Sequence or Topic, such as logging an error or status update. | | `*_notify_list` | Retrieves the history of active notifications for a resource, allowing clients to review alerts. | | `*_notify_purge` | Clears the notification history for a resource, useful for cleanup after resolution. | Here, `*` can be either `sequence` or `topic`. ## Query¶ | Action | Description | | --- | --- | | `query` | This action serves as the gateway to the query system. It accepts a complex filter object and returns a list of resources that match the criteria. | --- # Ingestion¶ Data ingestion in Mosaico is handled by the Flight `DoPut` streaming endpoint. This channel is explicitly engineered to handle write-heavy workloads, enabling the system to absorb high-bandwidth sensor data, such as 4K video streams or high-frequency Lidar point clouds—without contending with administrative traffic. ## The Ingestion Protocol¶ Data ingestion follows a structured protocol to ensure type safety and proper sequencing. The process begins with creating a new sequence using `sequence_create`, which takes a sequence name and optional user metadata, returning a unique sequence UUID. Within this sequence, you create topics for each data stream via `topic_create`, associating them with the sequence UUID and assigning unique paths like `my_sequence/topic/1`. Each topic can also include its own metadata. For each topic, data is uploaded using the Flight `do_put` operation, starting with an Arrow schema for validation, followed by streaming `RecordBatch` payloads. Once all topics are uploaded, the sequence is finalized with `sequence_finalize`, committing it to make the data immutable and queryable. During this process, the server validates schemas against registered ontologies, chunks data for efficient storage, and computes indices for fast querying. Ingestion protocol in pseudo-code ``` sq_uuid = do_action("my_sequence", metadata) # Create topic and upload data t1_uuid = do_action(sq_uuid, "my_sequence/topic/1", metadata) # (1)! do_put(t1_uuid, data_stream) do_action(sq_uuid) # (2)! ``` 1. The `topic_create` action returns a UUID that must be passed to the `do_put` call. 2. During finalization, all resources are consolidated and locked. Alternatively, you can call `sequence_abort(sq_uuid)`. Why UUIDs? UUIDs are employed in the ingestion protocol to prevent contentious uploads of the same resources. For instance, if two users attempt to create a new resource (such as a sequence or topic) with the same name, only one will succeed and receive a UUID. This UUID is then used in subsequent calls to ensure that operations are performed by the user who successfully created the resource. ## Chunking & Indexing Strategy¶ The backend automatically manages *chunking* to efficiently handle intra-sequence queries and prevent memory overload from streaming data. As data streams in, the server buffers the incoming data until a full chunk is accumulated, then writes it to disk as an optimal storage unit called a *chunk*. For each chunk written to disk, the server calculates and stores *skip indices* in the metadata database. These indices include ontology-specific statistics, such as type-specific metadata (e.g., coordinate bounding boxes for GPS data or value ranges for sensors). This allows the query engine to perform content-based filtering without needing to read the entire bulk data. --- # Retrieval¶ Measurement data in Mosaico is accessed through the Flight `DoGet` endpoint for high-performance read operations. Unlike simple file downloads, this channel provides an interface for requesting precise data slices, dynamically assembled and streamed back as optimized Arrow batches. ## The Retrieval Protocol¶ Accessing data requires specifying the **Locator**, which defines the topic path, and an optional time range in nanoseconds. The resolution process follows a coordinated sequence. Upon receiving a request, the server performs an index lookup in the metadata cache to identify physical data chunks intersecting the requested time window. This is followed by pruning, discarding chunks outside the query bounds to avoid redundant I/O. Once relevant segments are identified, the server streams the data by opening underlying files and delivering it in a high-throughput pipeline. In the protocol, the `get_flight_info` call returns a list of resources, each containing an endpoint (the name of the topic or sequence, such as `my_sequence` or `my_sequence/my/topic`) and a ticket, an opaque binary blob used by the server in the `do_get` call to extract and stream the data. Calling `get_flight_info` on a sequence returns all topics associated with that sequence, whereas calling it on a specific topic returns only the endpoint and ticket for that topic. Retrieval protocol in pseudo-code ``` locator = "my_sequence/topic/1" time_range = (start_ns, end_ns) # optional resources = get_flight_info(locator, time_range) for res in resources: print(res.endpoint) data_stream = do_get(res.ticket) ``` ## Metadata Context Headers¶ To provide full context, the data stream is prefixed with a Schema message containing embedded custom metadata. Mosaico injects context into this header for client reconstruction of the environment. This includes *user metadata*, preserving original project context like experimental tags or vehicle IDs, and the *ontology tag*, informing the client of sensor data types (e.g., `Lidar`, `Camera`) for type-safe deserialization. The *serialization format* guides interpretation of the underlying serialization protocol used. Now the supported formats include: * `Default`: The standard Arrow columnar layout. * `Ragged`: Optimized for variable-length lists. * `Image`: An optimized array format for high-resolution visual data. --- # Queries¶ Mosaico distinguishes itself from simple file stores with a powerful **Query System** capable of filtering data based on both high-level metadata and content values. The query engine operates through the `query` action, accepting structured JSON-based filter expressions that can span the entire data hierarchy. ## Query Architecture¶ The query engine is designed around a three-tier filtering model that allows you to construct complex, multi-dimensional searches: **Sequence Filtering.** Target recordings by structural attributes like sequence name, creation timestamp, or user-defined metadata tags. This level allows you to narrow down which recording sessions are relevant to your search. **Topic Filtering.** Refine your search to specific data streams within sequences. You can filter by topic name, ontology tag (the data type), serialization format, or topic-level user metadata. **Ontology Filtering.** Query the actual physical values recorded inside the sensor data without scanning terabytes of files. The engine leverages statistical indices computed during ingestion, min/max bounds stored in the metadata cache for each chunk, to rapidly include or exclude entire segments of data. ## Filter Domains¶ ### Sequence Filter¶ The sequence filter allows you to target specific recording sessions based on their metadata: | Field | Description | | --- | --- | | `sequence.name` | The sequence identifier (supports text operations) | | `sequence.creation` | The creation timestamp in nanoseconds (supports timestamp operations) | | `sequence.user_metadata.` | Custom user-defined metadata attached to the sequence | ### Topic Filter¶ The topic filter narrows the search to specific data streams within matching sequences: | Field | Description | | --- | --- | | `topic.name` | The topic path within the sequence (supports text operations) | | `topic.creation` | The topic creation timestamp in nanoseconds (supports timestamp operations) | | `topic.ontology_tag` | The data type identifier (e.g., `Lidar`, `Camera`, `IMU`) | | `topic.serialization_format` | The binary layout format (`Default`, `Ragged`, or `Image`) | | `topic.user_metadata.` | Custom user-defined metadata attached to the topic | ### Ontology Filter¶ The ontology filter queries the actual sensor data values. Fields are specified using dot notation: `.`. For example, to query IMU acceleration data: `imu.acceleration.x`, where `imu` is the ontology tag and `acceleration.x` is the field path within that data model. #### Timestamp query support¶ If `include_timestamp_range` is set to `true` the response will also return timestamps ranges for each query. ## Supported Operators¶ The query engine supports a rich set of comparison operators. Each operator is prefixed with `$` in the JSON syntax: | Operator | Description | | --- | --- | | `$eq` | Equal to (supports all types) | | `$neq` | Not equal to (supports all types) | | `$lt` | Less than (numeric and timestamp only) | | `$gt` | Greater than (numeric and timestamp only) | | `$leq` | Less than or equal to (numeric and timestamp only) | | `$geq` | Greater than or equal to (numeric and timestamp only) | | `$between` | Within a range `[min, max]` inclusive (numeric and timestamp only) | | `$in` | Value is in a set of options (supports integers and text) | | `$match` | Matches a pattern (text only, supports SQL LIKE patterns with `%` wildcards) | | `$ex` | Field exists | | `$nex` | Field does not exist | ## Query Syntax¶ Queries are submitted as JSON objects. Each field is mapped to an operator and value. Multiple conditions are combined with implicit AND logic. ``` { "sequence": { "name": { "$match": "test_run_%" }, "user_metadata": { "driver": { "$eq": "Alice" } } }, "topic": { "ontology_tag": { "$eq": "imu" } }, "ontology": { "imu.acceleration.x": { "$gt": 5.0 }, "imu.acceleration.y": { "$between": [-2.0, 2.0] }, "include_timestamp_range": true, // (1)! } } ``` 1. This filed is optional, if set to `true` the query returns the timestamp ranges This query searches for: * Sequences with names matching `test_run_%` pattern * Where the user metadata field `driver` equals `"Alice"` * Containing topics with ontology tag `imu` * Where the IMU's x-axis acceleration exceeds 5.0 * And the y-axis acceleration is between -2.0 and 2.0 ## Response Structure¶ The query response is hierarchically grouped by sequence. For each matching sequence, it provides the list of topics that satisfied the filter criteria, along with optional timestamp ranges indicating when the ontology conditions were met. Query response example ``` { "items": [ { "sequence": "test_run_01", "topics": [ { "locator": "test_run_01/sensors/imu", "timestamp_range": [1000000000, 2000000000] }, { "locator": "test_run_01/sensors/gps", "timestamp_range": [1000000000, 2000000000] } ] }, { "sequence": "test_run_02", "topics": [ { "locator": "test_run_02/camera/front", "timestamp_range": [1500000000, 2500000000] }, { "locator": "test_run_02/lidar/point_cloud", "timestamp_range": [1500000000, 2500000000] } ] } ] } ``` ### Timestamps¶ It returns the time window `[min, max]` where the filter conditions were met for that topic, with `min` being the timestamp of the first matching event and max being the timestamp of the last matching event. This allows you to retrieve only the relevant data slices using the retrieval protocol. Note The `timestamp_range` field is included only when ontology filters are applied and `include_timestamp_range` is set to `true` inside the `ontology` filter. ## Performance Characteristics¶ The query engine is optimized for high performance by minimizing unnecessary data retrieval and I/O operations. During execution, the engine uses index-based pruning to evaluate precomputed min/max statistics and skip indices, allowing it to bypass irrelevant data chunks without reading the underlying files. Performance is further improved by executing metadata cache queries, such as sequence and topic filters, directly within the database, which ensures sub-second response times even across thousands of sequences. The system employs **lazy evaluation** to keep network payloads lightweight; instead of returning raw data immediately, queries return locators and timestamp ranges. This architecture allows client applications to fetch only the required data slices via the retrieval protocol as needed. --- # CLI Reference¶ ## Run¶ Start the server locally with verbose logging: ``` mosaicod run [OPTIONS] ``` | Option | Default | Description | | --- | --- | --- | | `--host` | `false` | Listen on all addresses, including LAN and public addresses. | | `--port ` | `6726` | Port to listen on. | | `--local-store ` | `None` | Enable storage of objects on the local filesystem at the specified directory path. | --- # Release Cycle¶ This document describes briefly how the release process is handled for Mosaico project. We use semantic versioning `v..` to label every new official release. ## Development process¶ The basic idea is to use more than one develop branch to consent the progress of various versions simultaneously. Below, we introduce the terminology of branches and tags involved in the process: * `main`: this is the only stable branch, where every commit is an official release. Critical patches to the latest version are merged directly on this branch * `release/x.y.0`: this is the catch-all branch for the version `x.y.0`. Once ready it is merged back into `main` and deleted. * `issue//x.y.z`: this kind of branch is associated to the corresponding Github issue `#`. It can contain the development of a new feature or a bug-fix. It is a child of the corresponding `release/x.y.z` branch and it's merged back into it when completed * `hotfix/x.y.`: this branch is intended to contain critical fixes, documentation updates, and maintenance tasks. It is derived directly from `main` and merged back into it to produce the new official version `x.y.`. * `vx.y.z` this tag is created when a new stable version is ready. Let's have a look to an example ## Maintenance process¶ Maintenance of older major versions (LTS) follows a slightly different process. We add to the terminology the following branch: * `lts/x` this branch is created from the last official release of version x present in `main` and lives until the end of support. Only fixes are permitted using `issue/` branches. Once a new version is ready, it is tagged incrementing only the patch version (`vx.y.`). ---