Page

Table Of Contents


Built with 🛠 MkDocs - Theme 🖤 Github.

Hardware Acceleration Module

Overview

The Hardware Acceleration module in Neurenix provides comprehensive support for leveraging various hardware accelerators to optimize the performance of machine learning workloads. This module enables efficient execution of tensor operations, neural network training, and inference across a wide range of hardware platforms, from high-performance GPUs to specialized AI accelerators and edge devices.

Built on Neurenix's multi-language architecture, the Hardware Acceleration module implements high-performance operations in Rust and C++, with a user-friendly Python interface. This design ensures optimal performance while maintaining ease of use for researchers and developers.

The module supports multiple hardware acceleration technologies beyond basic CUDA, ROCm, and WebGPU implementations, including WebAssembly SIMD, Vulkan, OpenCL, DirectML, TensorRT, and specialized hardware like GraphCore IPUs and FPGAs. It also provides automatic device selection, graceful fallbacks, and comprehensive optimization techniques such as quantization and model pruning.

Key Concepts

Device Abstraction

The Hardware Acceleration module provides a unified device abstraction layer that enables consistent interaction with different hardware accelerators:

This abstraction allows users to write hardware-agnostic code that can run efficiently on various devices without modification.

Hardware Backends

The module supports multiple hardware backends, each optimized for specific hardware platforms:

Each backend implements the same interface, allowing seamless switching between different hardware platforms.

Automatic Optimization

The module provides automatic optimization techniques to maximize performance on different hardware:

These optimizations are applied transparently, allowing users to focus on model development rather than hardware-specific optimizations.

Distributed Execution

The module supports distributed execution across multiple devices and nodes:

This enables efficient scaling of training and inference workloads to large distributed systems.

API Reference

Device Management

import neurenix
from neurenix.hardware import get_device, set_device, device_count, is_available

# Check device availability
cuda_available = neurenix.hardware.is_available("cuda")
rocm_available = neurenix.hardware.is_available("rocm")
webgpu_available = neurenix.hardware.is_available("webgpu")

# Get device count
num_cuda_devices = neurenix.hardware.device_count("cuda")

# Get current device
current_device = neurenix.hardware.get_device()

# Set device
neurenix.hardware.set_device("cuda:0")

# Device context manager
with neurenix.hardware.device("cuda:1"):
    # Code in this block runs on CUDA device 1
    tensor = neurenix.tensor([1, 2, 3], device="cuda:1")

# Automatic device selection
neurenix.hardware.set_device("auto")  # Selects the best available device

Memory Management

from neurenix.hardware import memory_info, empty_cache, set_memory_fraction

# Get memory information
total, free, used = neurenix.hardware.memory_info("cuda:0")
print(f"Total: {total/1e9:.2f} GB, Free: {free/1e9:.2f} GB, Used: {used/1e9:.2f} GB")

# Empty cache
neurenix.hardware.empty_cache()

# Set memory fraction to use
neurenix.hardware.set_memory_fraction(0.8, "cuda:0")  # Use up to 80% of GPU memory

# Pinned memory allocation
pinned_tensor = neurenix.tensor([1, 2, 3], pin_memory=True)

# Unified memory allocation (accessible from both CPU and GPU)
unified_tensor = neurenix.tensor([1, 2, 3], memory="unified")

Stream Management

from neurenix.hardware import Stream, Event

# Create a stream
stream = neurenix.hardware.Stream(device="cuda:0")

# Execute operations on a stream
with stream:
    a = neurenix.tensor([1, 2, 3], device="cuda:0")
    b = neurenix.tensor([4, 5, 6], device="cuda:0")
    c = a + b

# Create an event
event = neurenix.hardware.Event(device="cuda:0")

# Record an event
event.record(stream)

# Wait for an event
event.wait()

# Synchronize a stream
stream.synchronize()

# Query if an event has completed
if event.query():
    print("Event completed")

Backend Configuration

from neurenix.hardware import set_backend, get_backend, available_backends

# Get available backends
backends = neurenix.hardware.available_backends()
print(f"Available backends: {backends}")

# Set backend
neurenix.hardware.set_backend("cuda")

# Get current backend
current_backend = neurenix.hardware.get_backend()

# Backend context manager
with neurenix.hardware.backend("rocm"):
    # Code in this block uses ROCm backend
    tensor = neurenix.tensor([1, 2, 3], device="rocm:0")

# Backend-specific configuration
neurenix.hardware.config.cuda.set("tensor_cores", True)
neurenix.hardware.config.rocm.set("hip_graph", True)
neurenix.hardware.config.webgpu.set("preferred_backend", "dawn")

Automatic Optimization

from neurenix.hardware import optimize, quantize, prune

# Optimize a model for specific hardware
optimized_model = neurenix.hardware.optimize(
    model=my_model,
    target_device="cuda:0",
    optimization_level=3,  # 0-3, higher means more aggressive optimization
    precision="mixed",  # "fp32", "fp16", "int8", "mixed"
    fuse_operations=True,
    enable_tensor_cores=True
)

# Quantize a model
quantized_model = neurenix.hardware.quantize(
    model=my_model,
    quantization_scheme="int8",  # "int8", "fp16", "fp8"
    calibration_dataset=calibration_data,
    per_channel=True,
    symmetric=False
)

# Prune a model
pruned_model = neurenix.hardware.prune(
    model=my_model,
    pruning_method="magnitude",  # "magnitude", "structured", "lottery_ticket"
    sparsity=0.7,  # Target sparsity level
    schedule="gradual"  # "one_shot" or "gradual"
)

Distributed Execution

from neurenix.hardware import distributed

# Initialize distributed environment
neurenix.hardware.distributed.init_process_group(
    backend="nccl",  # "nccl", "gloo", "mpi", "horovod"
    init_method="env://",
    world_size=4,
    rank=0
)

# Get distributed information
rank = neurenix.hardware.distributed.get_rank()
world_size = neurenix.hardware.distributed.get_world_size()
local_rank = neurenix.hardware.distributed.get_local_rank()

# Distributed data parallel model
model = neurenix.nn.Sequential(...)
distributed_model = neurenix.hardware.distributed.DistributedDataParallel(
    model,
    device_ids=[local_rank],
    output_device=local_rank,
    broadcast_buffers=True,
    find_unused_parameters=False
)

# All-reduce operation
tensor = neurenix.tensor([1, 2, 3], device=f"cuda:{local_rank}")
neurenix.hardware.distributed.all_reduce(tensor, op="sum")

# Barrier synchronization
neurenix.hardware.distributed.barrier()

# Cleanup
neurenix.hardware.distributed.destroy_process_group()

Hardware-Specific Features

NVIDIA Tensor Cores

from neurenix.hardware.cuda import enable_tensor_cores, is_tensor_cores_available

# Check if Tensor Cores are available
if neurenix.hardware.cuda.is_tensor_cores_available():
    # Enable Tensor Cores
    neurenix.hardware.cuda.enable_tensor_cores(True)

    # Create a model with Tensor Core optimizations
    model = neurenix.nn.Sequential(...)
    model = neurenix.hardware.cuda.optimize_for_tensor_cores(model)

WebAssembly SIMD

from neurenix.hardware.wasm import enable_simd, is_simd_available

# Check if WASM SIMD is available
if neurenix.hardware.wasm.is_simd_available():
    # Enable WASM SIMD
    neurenix.hardware.wasm.enable_simd(True)

    # Create a model optimized for WASM SIMD
    model = neurenix.nn.Sequential(...)
    model = neurenix.hardware.wasm.optimize_for_simd(model)

FPGA Acceleration

from neurenix.hardware.fpga import compile_for_fpga, load_bitstream

# Compile a model for FPGA
bitstream = neurenix.hardware.fpga.compile_for_fpga(
    model=my_model,
    target_device="xilinx_u250",
    optimization_level=3,
    precision="int8"
)

# Save bitstream
bitstream.save("/path/to/bitstream.bit")

# Load bitstream
loaded_model = neurenix.hardware.fpga.load_bitstream(
    bitstream_path="/path/to/bitstream.bit",
    device="fpga:0"
)

Framework Comparison

Neurenix Hardware vs. TensorFlow Hardware

Feature Neurenix Hardware TensorFlow Hardware
Device Abstraction Unified API for all hardware Separate APIs for different hardware
Hardware Support Comprehensive support for CUDA, ROCm, WebGPU, WASM, Vulkan, OpenCL, DirectML, FPGA, IPU, NPU Primarily focused on CUDA, ROCm, and TPU
Edge Device Support Native support with optimizations Limited through TensorFlow Lite
Quantization Automatic and manual quantization with multiple schemes Limited quantization options
Model Pruning Built-in pruning with various methods Requires separate tools
Distributed Training Integrated with multiple backends Primarily focused on TensorFlow-specific distribution
Memory Management Advanced memory management with unified memory Basic memory management
WebAssembly Support First-class WASM and WASI-NN support Limited WASM support

Neurenix Hardware provides more comprehensive hardware support compared to TensorFlow, with better integration of various acceleration technologies and more advanced optimization techniques. Its unified API makes it easier to target different hardware platforms without code changes.

Neurenix Hardware vs. PyTorch Hardware

Feature Neurenix Hardware PyTorch Hardware
Device Abstraction Unified API with automatic fallbacks Device-specific APIs
Hardware Support Comprehensive support for various hardware Primarily focused on CUDA with limited support for other hardware
Quantization Integrated quantization with multiple schemes Separate quantization toolkit
Model Pruning Built-in pruning with various methods Requires separate tools
Distributed Training Integrated with multiple backends Primarily focused on PyTorch-specific distribution
Memory Management Advanced memory management with unified memory Basic memory management
Edge Device Support Native support with optimizations Limited through separate tools
WebGPU Support First-class WebGPU support Limited WebGPU support

While PyTorch offers good CUDA support, Neurenix Hardware provides more comprehensive hardware support with better integration of various acceleration technologies. Its unified API and automatic optimizations make it easier to target different hardware platforms efficiently.

Neurenix Hardware vs. Scikit-Learn Hardware

Feature Neurenix Hardware Scikit-Learn Hardware
Device Abstraction Unified API for all hardware Limited hardware abstraction
Hardware Support Comprehensive support for various hardware Primarily CPU-focused with limited GPU support
Quantization Automatic and manual quantization Limited quantization options
Model Pruning Built-in pruning with various methods Limited pruning capabilities
Distributed Training Integrated with multiple backends Limited to joblib parallelism
Memory Management Advanced memory management Basic memory management
Edge Device Support Native support with optimizations Limited edge support
Specialized Hardware Support for FPGAs, IPUs, NPUs Limited specialized hardware support

Scikit-Learn is primarily designed for CPU execution with limited GPU support, while Neurenix Hardware provides comprehensive support for various hardware accelerators. This makes Neurenix particularly well-suited for hardware-accelerated machine learning workloads.

Best Practices

Efficient Device Utilization

  1. Choose the Right Device: Select the appropriate device for your workload.
# For large models with high computational requirements
neurenix.hardware.set_device("cuda:0")

# For deployment on edge devices
neurenix.hardware.set_device("wasm:simd")

# For cross-platform applications
neurenix.hardware.set_device("webgpu:0")

# Let Neurenix choose the best device
neurenix.hardware.set_device("auto")
  1. Minimize Host-Device Transfers: Keep data on the device as much as possible.
# Bad practice: Frequent transfers between CPU and GPU
for i in range(100):
    x = neurenix.tensor([i], device="cpu")
    x = x.to("cuda:0")
    y = x * 2
    result = y.to("cpu")

# Good practice: Keep data on the device
x = neurenix.tensor(list(range(100)), device="cuda:0")
y = x * 2
result = y.to("cpu")  # Transfer only once at the end
  1. Use Asynchronous Operations: Overlap computation and data transfer.
# Create streams for concurrent execution
stream1 = neurenix.hardware.Stream(device="cuda:0")
stream2 = neurenix.hardware.Stream(device="cuda:0")

# Execute operations concurrently
with stream1:
    a = neurenix.tensor([1, 2, 3], device="cuda:0")
    b = a * 2

with stream2:
    c = neurenix.tensor([4, 5, 6], device="cuda:0")
    d = c * 3

# Synchronize when results are needed
stream1.synchronize()
stream2.synchronize()
  1. Batch Processing: Process data in batches to amortize overhead.
# Process data in batches
batch_size = 64
for i in range(0, len(data), batch_size):
    batch = data[i:i+batch_size].to("cuda:0")
    results = model(batch)
    # Process results

Memory Management

  1. Monitor Memory Usage: Keep track of memory consumption.
# Check memory usage before and after operations
before_total, before_free, before_used = neurenix.hardware.memory_info("cuda:0")


# Perform operations
result = model(inputs)

after_total, after_free, after_used = neurenix.hardware.memory_info("cuda:0")
memory_used = after_used - before_used
print(f"Operation used {memory_used/1e6:.2f} MB")
  1. Use Memory-Efficient Techniques: Implement techniques to reduce memory usage.
# Gradient checkpointing
model = neurenix.nn.Sequential(...)
model = neurenix.hardware.enable_checkpointing(model)

# Mixed precision training
with neurenix.hardware.autocast(dtype="float16"):
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
  1. Release Memory When Not Needed: Explicitly free memory when possible.
# Delete tensors when not needed
del tensor
neurenix.hardware.empty_cache()

# Use context managers for temporary allocations
with neurenix.hardware.temp_tensors():
    # Tensors created here will be automatically freed
    temp = neurenix.zeros((10000, 10000), device="cuda:0")
    # Use temp
# temp is automatically freed here

Optimization Techniques

  1. Quantization: Reduce precision to improve performance.
# Dynamic quantization
quantized_model = neurenix.hardware.quantize(
    model=my_model,
    quantization_scheme="dynamic",
    dtype="int8"
)

# Static quantization with calibration
quantized_model = neurenix.hardware.quantize(
    model=my_model,
    quantization_scheme="static",
    dtype="int8",
    calibration_dataset=calibration_data
)

# Quantization-aware training
with neurenix.hardware.quantization_aware_training():
    for epoch in range(epochs):
        for inputs, targets in dataloader:
            outputs = model(inputs)
            loss = loss_fn(outputs, targets)
            loss.backward()
            optimizer.step()
  1. Model Pruning: Remove redundant weights to reduce model size.
# Magnitude pruning
pruned_model = neurenix.hardware.prune(
    model=my_model,
    pruning_method="magnitude",
    sparsity=0.7
)

# Structured pruning
pruned_model = neurenix.hardware.prune(
    model=my_model,
    pruning_method="structured",
    sparsity=0.5,
    structure_type="channel"
)

# Gradual pruning during training
pruner = neurenix.hardware.GradualPruner(
    model=my_model,
    pruning_method="magnitude",
    initial_sparsity=0.0,
    final_sparsity=0.8,
    start_epoch=5,
    end_epoch=25
)

for epoch in range(epochs):
    pruner.step(epoch)
    for inputs, targets in dataloader:
        outputs = model(inputs)
        loss = loss_fn(outputs, targets)
        loss.backward()
        optimizer.step()
  1. Operator Fusion: Combine multiple operations for better performance.
# Enable operator fusion
neurenix.hardware.config.set("operator_fusion", True)

# Create a model with fused operations
model = neurenix.nn.Sequential(...)
optimized_model = neurenix.hardware.optimize(
    model=model,
    fuse_operations=True
)

Multi-Device and Distributed Training

  1. Data Parallel Training: Distribute training across multiple devices.
# Initialize distributed environment
neurenix.hardware.distributed.init_process_group(
    backend="nccl",
    init_method="env://",
    world_size=world_size,
    rank=rank
)

# Create model and move to device
model = MyModel()
model = model.to(f"cuda:{local_rank}")

# Wrap model with DistributedDataParallel
model = neurenix.hardware.distributed.DistributedDataParallel(
    model,
    device_ids=[local_rank],
    output_device=local_rank
)

# Create distributed sampler
train_sampler = neurenix.data.distributed.DistributedSampler(
    dataset=train_dataset,
    num_replicas=world_size,
    rank=rank
)

# Create data loader with distributed sampler
train_loader = neurenix.data.DataLoader(
    dataset=train_dataset,
    batch_size=batch_size,
    sampler=train_sampler,
    num_workers=4,
    pin_memory=True
)

# Training loop
for epoch in range(epochs):
    train_sampler.set_epoch(epoch)
    for inputs, targets in train_loader:
        inputs = inputs.to(f"cuda:{local_rank}")
        targets = targets.to(f"cuda:{local_rank}")

        outputs = model(inputs)
        loss = loss_fn(outputs, targets)
        loss.backward()
        optimizer.step()
  1. Model Parallel Training: Split a model across multiple devices.
# Define a model with layers on different devices
class ModelParallelModel(neurenix.nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = neurenix.nn.Linear(1024, 512).to("cuda:0")
        self.layer2 = neurenix.nn.Linear(512, 256).to("cuda:1")
        self.layer3 = neurenix.nn.Linear(256, 10).to("cuda:2")

    def forward(self, x):
        x = x.to("cuda:0")
        x = self.layer1(x)
        x = x.to("cuda:1")
        x = self.layer2(x)
        x = x.to("cuda:2")
        x = self.layer3(x)
        return x

# Create model
model = ModelParallelModel()

# Training loop
for inputs, targets in dataloader:
    outputs = model(inputs)
    targets = targets.to("cuda:2")  # Move targets to the device of the output
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()
  1. Pipeline Parallel Training: Process different micro-batches on different devices.
from neurenix.hardware.distributed import PipelineParallel

# Define a model
model = neurenix.nn.Sequential(...)

# Create pipeline parallel model
pipeline_model = PipelineParallel(
    model=model,
    num_stages=4,
    devices=["cuda:0", "cuda:1", "cuda:2", "cuda:3"],
    chunks=4  # Number of micro-batches
)

# Training loop
for inputs, targets in dataloader:
    outputs = pipeline_model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

Tutorials

Basic Hardware Acceleration

import neurenix
from neurenix.hardware import get_device, set_device, memory_info
import time

# Check available devices
print("Available devices:")
for device_type in ["cpu", "cuda", "rocm", "webgpu", "vulkan", "opencl"]:
    if neurenix.hardware.is_available(device_type):
        count = neurenix.hardware.device_count(device_type)
        print(f"  {device_type}: {count} device(s)")

# Select the best available device
neurenix.hardware.set_device("auto")
current_device = neurenix.hardware.get_device()
print(f"Selected device: {current_device}")

# Create a simple model
model = neurenix.nn.Sequential(
    neurenix.nn.Linear(784, 256),
    neurenix.nn.ReLU(),
    neurenix.nn.Linear(256, 128),
    neurenix.nn.ReLU(),
    neurenix.nn.Linear(128, 10)
)

# Move model to device
model = model.to(current_device)

# Create random input data
batch_size = 64
input_data = neurenix.randn(batch_size, 784, device=current_device)

# Check memory before inference
total, free, used = neurenix.hardware.memory_info(current_device)
print(f"Memory before inference - Total: {total/1e9:.2f} GB, Free: {free/1e9:.2f} GB, Used: {used/1e9:.2f} GB")

# Warm-up
for _ in range(10):
    _ = model(input_data)

# Benchmark inference
num_iterations = 100
start_time = time.time()
for _ in range(num_iterations):
    _ = model(input_data)

# Synchronize to ensure all operations are complete
if current_device.startswith("cuda") or current_device.startswith("rocm"):
    neurenix.hardware.synchronize()

end_time = time.time()
elapsed_time = end_time - start_time
print(f"Inference time for {num_iterations} iterations: {elapsed_time:.4f} seconds")
print(f"Average time per batch: {elapsed_time / num_iterations * 1000:.2f} ms")

# Check memory after inference
total, free, used = neurenix.hardware.memory_info(current_device)
print(f"Memory after inference - Total: {total/1e9:.2f} GB, Free: {free/1e9:.2f} GB, Used: {used/1e9:.2f} GB")

# Clean up
neurenix.hardware.empty_cache()

Quantization and Optimization

import neurenix
from neurenix.hardware import quantize, optimize
from neurenix.data import DataLoader

# Load a pre-trained model
model = neurenix.models.resnet50(pretrained=True)
original_model = model.clone()  # Keep a copy of the original model

# Prepare a calibration dataset
calibration_dataset = neurenix.data.ImageFolder(
    root="/path/to/calibration/data",
    transform=neurenix.data.transforms.Compose([
        neurenix.data.transforms.Resize(256),
        neurenix.data.transforms.CenterCrop(224),
        neurenix.data.transforms.ToTensor(),
        neurenix.data.transforms.Normalize(
            mean=[0.485, 0.456, 0.406],
            std=[0.229, 0.224, 0.225]
        )
    ])
)

calibration_loader = DataLoader(
    calibration_dataset,
    batch_size=32,
    shuffle=True,
    num_workers=4
)

# Prepare evaluation dataset
eval_dataset = neurenix.data.ImageFolder(
    root="/path/to/eval/data",
    transform=neurenix.data.transforms.Compose([
        neurenix.data.transforms.Resize(256),
        neurenix.data.transforms.CenterCrop(224),
        neurenix.data.transforms.ToTensor(),
        neurenix.data.transforms.Normalize(
            mean=[0.485, 0.456, 0.406],
            std=[0.229, 0.224, 0.225]
        )
    ])
)

eval_loader = DataLoader(
    eval_dataset,
    batch_size=32,
    shuffle=False,
    num_workers=4
)

# Function to evaluate model
def evaluate_model(model, dataloader, device):
    model.eval()
    correct = 0
    total = 0
    with neurenix.no_grad():
        for inputs, targets in dataloader:
            inputs = inputs.to(device)
            targets = targets.to(device)
            outputs = model(inputs)
            _, predicted = outputs.max(1)
            total += targets.size(0)
            correct += predicted.eq(targets).sum().item()
    return correct / total * 100

# Set device
device = "cuda:0" if neurenix.hardware.is_available("cuda") else "cpu"
model = model.to(device)
original_model = original_model.to(device)

# Evaluate original model
print("Evaluating original model...")
original_accuracy = evaluate_model(original_model, eval_loader, device)
print(f"Original model accuracy: {original_accuracy:.2f}%")

# Quantize model to INT8
print("Quantizing model to INT8...")
quantized_model = neurenix.hardware.quantize(
    model=model.clone(),
    quantization_scheme="static",
    dtype="int8",
    calibration_dataset=calibration_loader,
    per_channel=True,
    symmetric=False
)

# Evaluate quantized model
print("Evaluating INT8 quantized model...")
quantized_accuracy = evaluate_model(quantized_model, eval_loader, device)
print(f"INT8 quantized model accuracy: {quantized_accuracy:.2f}%")
print(f"Accuracy difference: {original_accuracy - quantized_accuracy:.2f}%")

# Optimize model with operator fusion
print("Optimizing model with operator fusion...")
optimized_model = neurenix.hardware.optimize(
    model=model.clone(),
    target_device=device,
    optimization_level=2,
    fuse_operations=True,
    enable_tensor_cores=True if device.startswith("cuda") else False
)

# Evaluate optimized model
print("Evaluating optimized model...")
optimized_accuracy = evaluate_model(optimized_model, eval_loader, device)
print(f"Optimized model accuracy: {optimized_accuracy:.2f}%")
print(f"Accuracy difference: {original_accuracy - optimized_accuracy:.2f}%")

# Combine quantization and optimization
print("Applying both quantization and optimization...")
quantized_optimized_model = neurenix.hardware.optimize(
    model=quantized_model.clone(),
    target_device=device,
    optimization_level=2,
    fuse_operations=True,
    enable_tensor_cores=True if device.startswith("cuda") else False
)

# Evaluate combined model
print("Evaluating quantized and optimized model...")
combined_accuracy = evaluate_model(quantized_optimized_model, eval_loader, device)
print(f"Quantized and optimized model accuracy: {combined_accuracy:.2f}%")
print(f"Accuracy difference: {original_accuracy - combined_accuracy:.2f}%")

# Compare model sizes
original_size = neurenix.hardware.model_size(original_model)
quantized_size = neurenix.hardware.model_size(quantized_model)
optimized_size = neurenix.hardware.model_size(optimized_model)
combined_size = neurenix.hardware.model_size(quantized_optimized_model)

print(f"Original model size: {original_size/1e6:.2f} MB")
print(f"Quantized model size: {quantized_size/1e6:.2f} MB ({quantized_size/original_size*100:.2f}% of original)")
print(f"Optimized model size: {optimized_size/1e6:.2f} MB ({optimized_size/original_size*100:.2f}% of original)")
print(f"Quantized and optimized model size: {combined_size/1e6:.2f} MB ({combined_size/original_size*100:.2f}% of original)")

# Benchmark inference speed
def benchmark_inference(model, input_tensor, num_iterations=100):
    model.eval()
    # Warm-up
    for _ in range(10):
        _ = model(input_tensor)

    # Synchronize before timing
    if device.startswith("cuda"):
        neurenix.hardware.synchronize()

    start_time = time.time()
    for _ in range(num_iterations):
        _ = model(input_tensor)

    # Synchronize after timing
    if device.startswith("cuda"):
        neurenix.hardware.synchronize()

    end_time = time.time()
    elapsed_time = end_time - start_time
    return elapsed_time / num_iterations * 1000  # ms per inference

# Create a sample input tensor
input_tensor = neurenix.randn(1, 3, 224, 224, device=device)

# Benchmark all models
original_time = benchmark_inference(original_model, input_tensor)
quantized_time = benchmark_inference(quantized_model, input_tensor)
optimized_time = benchmark_inference(optimized_model, input_tensor)
combined_time = benchmark_inference(quantized_optimized_model, input_tensor)

print(f"Original model inference time: {original_time:.2f} ms per image")
print(f"Quantized model inference time: {quantized_time:.2f} ms per image ({original_time/quantized_time:.2f}x speedup)")
print(f"Optimized model inference time: {optimized_time:.2f} ms per image ({original_time/optimized_time:.2f}x speedup)")
print(f"Quantized and optimized model inference time: {combined_time:.2f} ms per image ({original_time/combined_time:.2f}x speedup)")

# Save the optimized model
neurenix.save(quantized_optimized_model, "quantized_optimized_resnet50.nrx")
print("Saved quantized and optimized model to quantized_optimized_resnet50.nrx")

Multi-GPU Distributed Training

import neurenix
import os
import argparse
from neurenix.hardware import distributed
from neurenix.data import DataLoader, DistributedSampler

# Parse arguments
parser = argparse.ArgumentParser(description='Distributed training example')
parser.add_argument('--nodes', default=1, type=int, help='number of nodes')
parser.add_argument('--gpus', default=1, type=int, help='number of gpus per node')
parser.add_argument('--node_rank', default=0, type=int, help='ranking within the nodes')
parser.add_argument('--master_addr', default='localhost', type=str, help='address of master node')
parser.add_argument('--master_port', default='12355', type=str, help='port of master node')
args = parser.parse_args()

# Calculate world_size and rank
args.world_size = args.gpus * args.nodes
os.environ['MASTER_ADDR'] = args.master_addr
os.environ['MASTER_PORT'] = args.master_port

# Define the training function
def train(gpu, args):
    # Calculate rank
    rank = args.node_rank * args.gpus + gpu

    # Initialize distributed process group
    distributed.init_process_group(
        backend='nccl',
        init_method='env://',
        world_size=args.world_size,
        rank=rank
    )

    # Create model
    model = neurenix.models.resnet18()

    # Move model to GPU
    device = f'cuda:{gpu}'
    model = model.to(device)

    # Wrap model with DistributedDataParallel
    model = distributed.DistributedDataParallel(
        model,
        device_ids=[gpu],
        output_device=gpu
    )

    # Create dataset and sampler
    train_dataset = neurenix.data.CIFAR10(
        root='/path/to/data',
        train=True,
        download=True,
        transform=neurenix.data.transforms.Compose([
            neurenix.data.transforms.RandomCrop(32, padding=4),
            neurenix.data.transforms.RandomHorizontalFlip(),
            neurenix.data.transforms.ToTensor(),
            neurenix.data.transforms.Normalize(
                mean=[0.4914, 0.4822, 0.4465],
                std=[0.2470, 0.2435, 0.2616]
            )
        ])
    )

    train_sampler = DistributedSampler(
        dataset=train_dataset,
        num_replicas=args.world_size,
        rank=rank
    )

    train_loader = DataLoader(
        dataset=train_dataset,
        batch_size=128,
        shuffle=False,  # Sampler handles shuffling
        num_workers=4,
        pin_memory=True,
        sampler=train_sampler
    )

    # Define optimizer and loss function
    optimizer = neurenix.optim.SGD(
        model.parameters(),
        lr=0.1,
        momentum=0.9,
        weight_decay=1e-4
    )

    criterion = neurenix.nn.CrossEntropyLoss()

    # Training loop
    for epoch in range(10):
        train_sampler.set_epoch(epoch)
        model.train()

        for i, (images, target) in enumerate(train_loader):
            images = images.to(device)
            target = target.to(device)

            # Forward pass
            output = model(images)
            loss = criterion(output, target)

            # Backward pass and optimize
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            if i % 100 == 0 and rank == 0:
                print(f'Epoch: {epoch}, Batch: {i}, Loss: {loss.item()}')

        # Save checkpoint (only on rank 0)
        if rank == 0:
            neurenix.save({
                'epoch': epoch,
                'model_state_dict': model.module.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
            }, f'checkpoint_epoch_{epoch}.nrx')

    # Clean up
    distributed.destroy_process_group()

# Launch training processes
if __name__ == '__main__':
    # Use multiprocessing to launch multiple processes on a single node
    import neurenix.hardware.multiprocessing as mp
    mp.spawn(train, nprocs=args.gpus, args=(args,))

Conclusion

The Hardware Acceleration module in Neurenix provides a comprehensive set of tools for leveraging various hardware accelerators to optimize the performance of machine learning workloads. Its unified device abstraction, support for multiple hardware backends, automatic optimization techniques, and distributed execution capabilities make it a powerful solution for accelerating machine learning applications across different hardware platforms.

Compared to other frameworks like TensorFlow, PyTorch, and Scikit-Learn, Neurenix Hardware offers advantages in terms of hardware support, optimization techniques, and ease of use. Its multi-language architecture ensures optimal performance while maintaining a user-friendly interface, making it suitable for both research and production environments.

By following the best practices and tutorials outlined in this documentation, users can leverage the full power of their hardware to accelerate machine learning workloads, from training large models on high-performance GPUs to deploying optimized models on edge devices.