A GSoC project to automatically partition and parallelize hardware simulations in Verilator using MPI.As modern SoC designs get more and more complex, especially manycore-based ones, simulation performance becomes a serious bottleneck. RTL simulation is still the most accurate way to verify digital designs, but the traditional monolithic simulators don’t scale well when the design has many replicated hardware blocks like cores or NoC components. This often results in extremely long simulation times, which slows down the development design cycle.Newer simulators do give us the option to do parallel simulation, but they lack one important aspect: they fail to give the simulator (or the compiler that does the parsing and AST construction) a perspective of the physical structure of the hardware design. Because of this, the preprocessing, AST construction, elaboration, and optimization follow a standard approach that a general-purpose software language compiler like GCC follows. However, unlike C and C++, HDLs carry much more information that is not visible to general-purpose compilers. An intuitive example would be the case of gem5: when we are modifying some structures like the O3 CPU model, it may happen that we are able to complete the building process of the binary without throwing any errors but later fail during the simulation. This happens because of the same reason—GCC does not know what this code represents, and it does exactly the same thing it does with other C++ code. Apart from this, the current parallel simulation frameworks lack the ability to scale.To handle the scaling issue, my mentors, Dr. Guillem and Prof. Jonathan, have come up with a novel way of parallelizing RTL simulations, targeting OpenPiton, Metro-MPI. This novel approach breaks the entire binary simulating the whole design into smaller ones simulating a top-level system and partitioned/duplicated hardware blocks, keeping the hardware boundaries in consideration so that the data movement between these different binaries can be minimized. Then, these binaries are simulated in parallel on different threads across multiple nodes using MPI (Message Passing Interface), which is a de facto standard for communication among processes that model a parallel program running on a distributed memory system.The reason we opted for this approach even though Verilator, an open-source SystemVerilog simulator, does provide an inbuilt partitioning and scheduling mechanism based on a 1989 paper “Partitioning and Scheduling Parallel Programs for Multiprocessors” is because this approach is too generic, and we can do better by making the partitioner and scheduler aware of the hardware structures.In this project, Metro-MPI++, my goal was to take the same philosophy as in Metro-MPI and automatically enable it inside Verilator:To implement this idea in Verilator, first, we need to understand how exactly these RTL simulators work and what steps are involved from start to end. After a careful study, the most optimal place to implement this is just after the AST construction is completed, as we can get all the information about the entire design from all modules from the AST itself and before the elaboration step. As a result, the first approach we tried was to analyze the XML file that Verilator outputs, which contains information about the entire design. We found that it was sufficient to carry out our work since this XML file was generated from the AST.For further explanation of the same, let’s take the example of OpenPiton 2x2 configuration:Before implementing the main partitioning and MPI integration features, the first critical step was to upgrade OpenPiton’s(the world’s first open-source, general-purpose, multithreaded manycore processor with a 64-bit Ariane RISC-V core) support for Verilator v5 from v4. The original framework relied on Verilator v4.x, but newer versions like v5.038 are already available; upgrading was essential for long-term maintainability and compatibility.This upgrade introduced several challenges due to major internal changes in Verilator between v4.x and v5.x:Common Issues in All Versions of Verilator v5.x:v5.x Initialization Sequence: v4.x was consistent with SystemVerilog, i.e., initial blocks would run before the DPI calls into the simulation, but in v5.x, the scheduler was rewritten. DPI-C calls from the host side can be scheduled before the initial blocks in the design have executed. This means b_open() or similar setup code in an initial block might not have run yet when write_64b_call() or read_64b_call() is first called. It may try to access a memory address even before it is initialized, resulting in a segmentation fault.By doing this inside the write_64b_call() or read_64b_call() functions, we are initializing the root/memory if it is not initialized, with a 0 value.Issues with Particular Versions:Negative Values: The issue of this error is most probably the fact that v5.x is more strict and has more standards-compliant error checking. In the design, any signal must not get any negative value at all, and if it may happen, then it’s better to have padding to clip it to 0. [ Issue ]The goal of Metro-MPI++ is to automatically analyze a Verilog design to identify parallelizable sections and map their communication pathways. When Verilator is done constructing the Abstract Syntax Tree (AST), we execute the metro_mpi() function, which essentially runs Metro-MPI++ logic using the constructed AST. The metro_mpi() function executes in stages, first entering the V3Metro_MPI.h.The first and most critical step is to identify which parts of the hardware design are suitable for partitioning and parallel simulation. The framework employs a heuristic-based approach that identifies structurally identical, repeated module instances within the design hierarchy. This process is managed by the HierCellsGraphVisitor class.The detection algorithm operates as follows:Structural Hashing: To identify structurally identical sub-hierarchies, a unique hash is generated for each node. This hash is not based on the instance name (e.g., $root.core_0) but on the hierarchical path of module types (e.g., $root.Top.Core). The system uses the blake2b algorithm for this purpose. This ensures that two instances, core_0 and core_1, both of type Core under a Top module, will produce the same hash, even though their instance paths are different.To compute these hashes, we first perform a DFS traversal. Once we reach a leaf node in the AST, we calculate the hash of its module name (not instance name). We repeat this for all nodes at the same level, generating a 128-bit hash for each name since blake2b takes variable-size input and produces a hash of fixed length. Then, as DFS moves to the parent node, we compute the hash of the parent module by operating the hash function on ......., again yielding a fixed-length hash.So, by choosing blake2b we get a consistent hash for all nodes. In this way, we are ensuring that if any two nodes have the same hash, then with certainty we can assume that the hierarchy below those nodes are exactly the same or, they represent a duplicate hardware block.Partition Selection (BFS): With the graph built with weights, a Breadth-First Search (BFS) is used to traverse the hierarchy level by level. At each level, the algorithm groups instances by their structural hash.Once this “best” group is identified, the algorithm designates their common module type as the partition module and outputs the list of instance names to be analyzed further.The creation of a functional MPI communication fabric is critically dependent on a detailed understanding of the design’s data flow. Before MPI structures can be generated to pass information between processes, the system must know precisely which port on a given instance connects to which peers. Therefore, after identifying the partition instances, the PartitionPortAnalyzer class is invoked to perform an exhaustive analysis of the parent module’s netlist. This step extracts the port-to-peer connectivity data required to construct the MPI layer that bridges the parallel processes.In this analysis, we also perform an optimization to reduce data movement between MPI ranks during runtime. For example, it is clear from the above image that both tiles are connected via wires defined in the chip module. Inherently, the data should flow from tile0 → chip → tile1, as wires are part of the chip module, and vice versa. However, we don’t need the messages to pass through the chip module if the two instances are directly communicating with each other.To avoid this, the analysis recursively looks into the connections of each port of each module instance and classifies:More details of this analysis are mentioned below:To ensure a functional and optimized parallel simulation, the framework must guarantee that every communicating process has a unique identifier and that the results of the connectivity analysis are captured in a clear, comprehensive, and usable format.A fundamental requirement for any MPI-based application is that each parallel process must have a unique integer identifier, known as its rank. The PartitionPortAnalyzer class establishes a globally unique and consistent ranking system before the main analysis begins.The assignment process is as follows:System Rank: A special conceptual process named “system” is always assigned rank 0. This rank represents all non-partitioned logic, the top-level testbench, and any I/O external to the partitioned instances.Deterministic Sorting: To ensure that the analysis is repeatable and stable, the list of discovered partition instance names is sorted alphabetically. This critical step prevents rank assignments from changing between different runs of the tool, which is essential for consistent builds.Sequential Assignment: After sorting, the framework iterates through the list of partition instances and assigns them sequential, incremental ranks starting from 1 (e.g., 1, 2, 3, …).Centralized Mapping: These assignments are stored in a map (m_mpiRankMap), which serves as the directory for retrieving the rank for any partition instance or the system process during the analysis. The final rank for each port and its communication partners is stored directly within the Port and CommunicationPartner data structures.The reason we introduce MPI ranks here, even though ranks are a runtime assignment/property, is because by doing this we can correlate any identifier of the partitions with the rank as we control the rank assignment. More importantly, it makes the generation of MPI structures and MPI send & receive functions very straightforward.The framework generates two distinct reports from the analysis data, one tailored for human review and the other for machine consumption by downstream automation tools.Machine-Readable JSON Report (writeJsonReport)Each port object in the JSON is comprehensive, containing fields for:After the connectivity analysis is complete and the partition_report.json is generated, the Metro-MPI framework transitions from analysis to code generation. The MPIFileGenerator class, as referenced in HierCellsGraphVisitor::findAndPrintPartitionPorts, orchestrates a multi-step process to modify the original Verilog source code. This rewriting is essential to intercept communication to and from the partitioned modules and redirect it through the MPI-enabled C++ simulation environment. The process involves creating new Verilog modules and modifying existing ones to integrate the necessary simulation hooks.The bridge between the Verilog simulation domain and the C++ MPI domain is the SystemVerilog Direct Programming Interface (DPI). The framework automatically generates a C++ header file (metro_mpi_dpi.h) containing DPI “import” function declarations. These functions act as stubs that the Verilog code can call.The generation process, managed within the MPIFileGenerator’s logic, proceeds as follows:The resulting metro_mpi_dpi.h file contains a list of C-style function prototypes that will be implemented in a separate C++ file (metro_mpi.cpp, generated by MPICodeGenerator). This header is included by the Verilog simulator, making these C++ functions visible and callable from the hardware design.The original partitioned module (.v) is not modified. Instead, for each instance identified for partitioning (e.g., core_0, core_1), the framework generates a new, instance-specific Verilog wrapper module. For an instance core_0 of module Core, a new file named metro_mpi_wrapper_core_0.v is created.This wrapper module serves as a crucial intermediary:This strategy cleverly isolates the original design logic. The core module remains untouched, while the wrapper provides the necessary “hooks” to interface with the external MPI simulation driver.The final step in the Verilog rewriting process is to modify the parent module—the module that originally contained the partitioned instances. This is the most invasive step, but it is performed programmatically to ensure correctness.The MPIFileGenerator performs the following actions:By replacing the original instances with their MPI-enabled wrappers, the framework effectively severs the direct wire-level connections between the partitions and the rest of the design. All communication is now forced to flow through the DPI interface, into the C++ domain where it can be managed by the MPI runtime, and back into the Verilog simulation, enabling true parallel execution of the partitioned hardware blocks.After the Verilog sources have been rewritten, the Metro-MPI framework proceeds to generate all the necessary C++ source code to build the final, distributed simulation executables. The generation process is orchestrated by the HierCellsGraphVisitor class, which calls a series of specialized generator classes that use the partition_report.json as their primary input.The core of the communication system is a C++ layer that implements the DPI functions declared in metro_mpi_dpi.h. This layer translates the function calls made from the Verilog domain into actual MPI network operations.The MPICodeGenerator class is responsible for creating this file. It reads the JSON report to understand the complete communication graph—which port on which instance communicates with whom. For each send and receive DPI stub, it generates a corresponding C++ function body that:This generated C++ file effectively serves as the middleware that bridges the Verilated hardware model with the MPI runtime system.Each partitioned module must be compiled into its own standalone executable that will run on MPI ranks greater than zero (1, 2, 3, etc.). The MPIMainGenerator class is responsible for generating the main C++ driver file for these executables.The generate method of this class takes the original partition module name as an argument (e.g., Core). It creates a C++ source file containing a main() function that performs the following steps:The entire distributed simulation is controlled by a master process running on Rank 0. This process runs the top-level testbench, drives primary inputs like clock and reset, and manages the overall simulation time. The Rank0MainGenerator class is responsible for generating this master testbench harness.The code first carefully determines the correct top-level module name for the Rank 0 simulation, even accounting for Verilator’s process of wrapping the user’s design in a $root module. The generate function is then called with this top module name. The resulting C++ file contains a main() function that:A key feature of the Metro-MPI framework is its ability to automatically generate a complete build system for the complex, multi-executable MPI simulation. This process, orchestrated within the HierCellsGraphVisitor class, ensures that all components (Verilog wrappers, C++ harnesses, and communication layers) are compiled and linked correctly with the necessary MPI libraries and user-defined configurations.The framework automates the creation of a Makefile by using a dedicated MakefileGenerator class. This component is responsible for producing a build script tailored to the specific partitioning results.The generation process is informed by a detailed analysis of the design’s source file dependencies. A recursive function, collectPartitionFiles, traverses the design hierarchy of the partitioned module to gather a unique set of all .v or .sv source files it depends on. This list of files, along with the name of the top-level partition module, is passed to the makefileGenerator.generate method. This allows the generator to create specific build rules in the Makefile for compiling the partition module into its own object files, separate from the rules for compiling the main Rank 0 testbench harness.To ensure a consistent and correct build, it is critical that any command-line options passed to the initial Verilator run (such as include directories, defines, optimization flags, or tracing options) are preserved in the final MPI build.The framework achieves this through an extern std::string argString variable. This variable captures the original command-line arguments. The argString is then passed directly to the MakefileGenerator’s generate method. The generator incorporates these preserved arguments into the compilation commands within the Makefile, ensuring that the Verilation of both the partition and top-level modules occurs with the exact same configuration intended by the user.Compiling and linking C++ applications with an MPI library requires using a special compiler wrapper, such as mpic++. This wrapper automatically includes the necessary MPI header paths and links against the required MPI libraries.The MakefileGenerator, being a core component of an MPI-centric tool, is designed to integrate this requirement seamlessly. It generates Makefile rules that invoke the mpic++ compiler (or an equivalent) for all C++ compilation and linking steps. By generating a Makefile that uses the appropriate MPI compiler, the framework abstracts away the complexity of locating and linking the MPI libraries, providing a simple and robust build process for the user.We can see the repo that I worked on-verilator - hereMetro-MPI - here