Hardware Acceleration Module | API - Neurenix Documentation

← Previous Page

Overview¶

The Hardware Acceleration module in Neurenix provides comprehensive support for leveraging various hardware accelerators to optimize the performance of machine learning workloads. This module enables efficient execution of tensor operations, neural network training, and inference across a wide range of hardware platforms, from high-performance GPUs to specialized AI accelerators and edge devices.

Built on Neurenix's multi-language architecture, the Hardware Acceleration module implements high-performance operations in Rust and C++, with a user-friendly Python interface. This design ensures optimal performance while maintaining ease of use for researchers and developers.

The module supports multiple hardware acceleration technologies beyond basic CUDA, ROCm, and WebGPU implementations, including WebAssembly SIMD, Vulkan, OpenCL, DirectML, TensorRT, and specialized hardware like GraphCore IPUs and FPGAs. It also provides automatic device selection, graceful fallbacks, and comprehensive optimization techniques such as quantization and model pruning.

Key Concepts¶

Device Abstraction¶

The Hardware Acceleration module provides a unified device abstraction layer that enables consistent interaction with different hardware accelerators:

Device Management: Unified API for device discovery, selection, and management
Memory Management: Efficient memory allocation, transfer, and synchronization
Stream Management: Concurrent execution of operations on devices
Event Management: Synchronization between operations and devices

This abstraction allows users to write hardware-agnostic code that can run efficiently on various devices without modification.

Hardware Backends¶

The module supports multiple hardware backends, each optimized for specific hardware platforms:

CUDA: NVIDIA GPU acceleration with support for Tensor Cores
ROCm: AMD GPU acceleration
WebGPU: Cross-platform GPU acceleration for web and native applications
WebAssembly: SIMD and WASI-NN for browser and edge device acceleration
Vulkan/OpenCL/oneAPI: Cross-platform GPU acceleration
DirectML: Windows-specific acceleration
oneDNN/MKL-DNN: Optimized deep learning primitives
TensorRT: NVIDIA-specific optimizations
GraphCore IPU: Specialized AI accelerator support
FPGA: Programmable hardware acceleration via OpenCL, Xilinx Vitis, and Intel OpenVINO
NPU: Neural Processing Unit acceleration

Each backend implements the same interface, allowing seamless switching between different hardware platforms.

Automatic Optimization¶

The module provides automatic optimization techniques to maximize performance on different hardware:

Automatic Quantization: INT8, FP16, FP8 quantization for reduced memory usage and faster computation
Model Pruning: Removal of redundant weights and operations
Operator Fusion: Combining multiple operations into optimized kernels
Memory Planning: Efficient allocation and reuse of memory
Kernel Autotuning: Automatic selection of optimal kernel implementations
Mixed Precision Training: Training with lower precision while maintaining accuracy

These optimizations are applied transparently, allowing users to focus on model development rather than hardware-specific optimizations.

Distributed Execution¶

The module supports distributed execution across multiple devices and nodes:

Multi-GPU Training: Efficient utilization of multiple GPUs on a single node
Multi-Node Training: Scaling across multiple compute nodes
Hybrid Execution: Utilizing heterogeneous hardware in a single system
Communication Backends: MPI, Horovod, and DeepSpeed for efficient communication
Synchronization Primitives: Barriers, locks, and events for coordinated execution

This enables efficient scaling of training and inference workloads to large distributed systems.

API Reference¶

Device Management¶

import neurenix
from neurenix.hardware import get_device, set_device, device_count, is_available

# Check device availability
cuda_available = neurenix.hardware.is_available("cuda")
rocm_available = neurenix.hardware.is_available("rocm")
webgpu_available = neurenix.hardware.is_available("webgpu")

# Get device count
num_cuda_devices = neurenix.hardware.device_count("cuda")

# Get current device
current_device = neurenix.hardware.get_device()

# Set device
neurenix.hardware.set_device("cuda:0")

# Device context manager
with neurenix.hardware.device("cuda:1"):
    # Code in this block runs on CUDA device 1
    tensor = neurenix.tensor([1, 2, 3], device="cuda:1")

# Automatic device selection
neurenix.hardware.set_device("auto")  # Selects the best available device

Memory Management¶

from neurenix.hardware import memory_info, empty_cache, set_memory_fraction

# Get memory information
total, free, used = neurenix.hardware.memory_info("cuda:0")
print(f"Total: {total/1e9:.2f} GB, Free: {free/1e9:.2f} GB, Used: {used/1e9:.2f} GB")

# Empty cache
neurenix.hardware.empty_cache()

# Set memory fraction to use
neurenix.hardware.set_memory_fraction(0.8, "cuda:0")  # Use up to 80% of GPU memory

# Pinned memory allocation
pinned_tensor = neurenix.tensor([1, 2, 3], pin_memory=True)

# Unified memory allocation (accessible from both CPU and GPU)
unified_tensor = neurenix.tensor([1, 2, 3], memory="unified")

Stream Management¶

from neurenix.hardware import Stream, Event

# Create a stream
stream = neurenix.hardware.Stream(device="cuda:0")

# Execute operations on a stream
with stream:
    a = neurenix.tensor([1, 2, 3], device="cuda:0")
    b = neurenix.tensor([4, 5, 6], device="cuda:0")
    c = a + b

# Create an event
event = neurenix.hardware.Event(device="cuda:0")

# Record an event
event.record(stream)

# Wait for an event
event.wait()

# Synchronize a stream
stream.synchronize()

# Query if an event has completed
if event.query():
    print("Event completed")

Backend Configuration¶

from neurenix.hardware import set_backend, get_backend, available_backends

# Get available backends
backends = neurenix.hardware.available_backends()
print(f"Available backends: {backends}")

# Set backend
neurenix.hardware.set_backend("cuda")

# Get current backend
current_backend = neurenix.hardware.get_backend()

# Backend context manager
with neurenix.hardware.backend("rocm"):
    # Code in this block uses ROCm backend
    tensor = neurenix.tensor([1, 2, 3], device="rocm:0")

# Backend-specific configuration
neurenix.hardware.config.cuda.set("tensor_cores", True)
neurenix.hardware.config.rocm.set("hip_graph", True)
neurenix.hardware.config.webgpu.set("preferred_backend", "dawn")

Automatic Optimization¶

from neurenix.hardware import optimize, quantize, prune

# Optimize a model for specific hardware
optimized_model = neurenix.hardware.optimize(
    model=my_model,
    target_device="cuda:0",
    optimization_level=3,  # 0-3, higher means more aggressive optimization
    precision="mixed",  # "fp32", "fp16", "int8", "mixed"
    fuse_operations=True,
    enable_tensor_cores=True
)

# Quantize a model
quantized_model = neurenix.hardware.quantize(
    model=my_model,
    quantization_scheme="int8",  # "int8", "fp16", "fp8"
    calibration_dataset=calibration_data,
    per_channel=True,
    symmetric=False
)

# Prune a model
pruned_model = neurenix.hardware.prune(
    model=my_model,
    pruning_method="magnitude",  # "magnitude", "structured", "lottery_ticket"
    sparsity=0.7,  # Target sparsity level
    schedule="gradual"  # "one_shot" or "gradual"
)

Distributed Execution¶

from neurenix.hardware import distributed

# Initialize distributed environment
neurenix.hardware.distributed.init_process_group(
    backend="nccl",  # "nccl", "gloo", "mpi", "horovod"
    init_method="env://",
    world_size=4,
    rank=0
)

# Get distributed information
rank = neurenix.hardware.distributed.get_rank()
world_size = neurenix.hardware.distributed.get_world_size()
local_rank = neurenix.hardware.distributed.get_local_rank()

# Distributed data parallel model
model = neurenix.nn.Sequential(...)
distributed_model = neurenix.hardware.distributed.DistributedDataParallel(
    model,
    device_ids=[local_rank],
    output_device=local_rank,
    broadcast_buffers=True,
    find_unused_parameters=False
)

# All-reduce operation
tensor = neurenix.tensor([1, 2, 3], device=f"cuda:{local_rank}")
neurenix.hardware.distributed.all_reduce(tensor, op="sum")

# Barrier synchronization
neurenix.hardware.distributed.barrier()

# Cleanup
neurenix.hardware.distributed.destroy_process_group()

Hardware-Specific Features¶

NVIDIA Tensor Cores¶

from neurenix.hardware.cuda import enable_tensor_cores, is_tensor_cores_available

# Check if Tensor Cores are available
if neurenix.hardware.cuda.is_tensor_cores_available():
    # Enable Tensor Cores
    neurenix.hardware.cuda.enable_tensor_cores(True)

    # Create a model with Tensor Core optimizations
    model = neurenix.nn.Sequential(...)
    model = neurenix.hardware.cuda.optimize_for_tensor_cores(model)

WebAssembly SIMD¶

from neurenix.hardware.wasm import enable_simd, is_simd_available

# Check if WASM SIMD is available
if neurenix.hardware.wasm.is_simd_available():
    # Enable WASM SIMD
    neurenix.hardware.wasm.enable_simd(True)

    # Create a model optimized for WASM SIMD
    model = neurenix.nn.Sequential(...)
    model = neurenix.hardware.wasm.optimize_for_simd(model)

FPGA Acceleration¶

from neurenix.hardware.fpga import compile_for_fpga, load_bitstream

# Compile a model for FPGA
bitstream = neurenix.hardware.fpga.compile_for_fpga(
    model=my_model,
    target_device="xilinx_u250",
    optimization_level=3,
    precision="int8"
)

# Save bitstream
bitstream.save("/path/to/bitstream.bit")

# Load bitstream
loaded_model = neurenix.hardware.fpga.load_bitstream(
    bitstream_path="/path/to/bitstream.bit",
    device="fpga:0"
)

Framework Comparison¶

Neurenix Hardware vs. TensorFlow Hardware¶

Feature	Neurenix Hardware	TensorFlow Hardware
Device Abstraction	Unified API for all hardware	Separate APIs for different hardware
Hardware Support	Comprehensive support for CUDA, ROCm, WebGPU, WASM, Vulkan, OpenCL, DirectML, FPGA, IPU, NPU	Primarily focused on CUDA, ROCm, and TPU
Edge Device Support	Native support with optimizations	Limited through TensorFlow Lite
Quantization	Automatic and manual quantization with multiple schemes	Limited quantization options
Model Pruning	Built-in pruning with various methods	Requires separate tools
Distributed Training	Integrated with multiple backends	Primarily focused on TensorFlow-specific distribution
Memory Management	Advanced memory management with unified memory	Basic memory management
WebAssembly Support	First-class WASM and WASI-NN support	Limited WASM support

Neurenix Hardware provides more comprehensive hardware support compared to TensorFlow, with better integration of various acceleration technologies and more advanced optimization techniques. Its unified API makes it easier to target different hardware platforms without code changes.

Neurenix Hardware vs. PyTorch Hardware¶

Feature	Neurenix Hardware	PyTorch Hardware
Device Abstraction	Unified API with automatic fallbacks	Device-specific APIs
Hardware Support	Comprehensive support for various hardware	Primarily focused on CUDA with limited support for other hardware
Quantization	Integrated quantization with multiple schemes	Separate quantization toolkit
Model Pruning	Built-in pruning with various methods	Requires separate tools
Distributed Training	Integrated with multiple backends	Primarily focused on PyTorch-specific distribution
Memory Management	Advanced memory management with unified memory	Basic memory management
Edge Device Support	Native support with optimizations	Limited through separate tools
WebGPU Support	First-class WebGPU support	Limited WebGPU support

While PyTorch offers good CUDA support, Neurenix Hardware provides more comprehensive hardware support with better integration of various acceleration technologies. Its unified API and automatic optimizations make it easier to target different hardware platforms efficiently.

Neurenix Hardware vs. Scikit-Learn Hardware¶

Feature	Neurenix Hardware	Scikit-Learn Hardware
Device Abstraction	Unified API for all hardware	Limited hardware abstraction
Hardware Support	Comprehensive support for various hardware	Primarily CPU-focused with limited GPU support
Quantization	Automatic and manual quantization	Limited quantization options
Model Pruning	Built-in pruning with various methods	Limited pruning capabilities
Distributed Training	Integrated with multiple backends	Limited to joblib parallelism
Memory Management	Advanced memory management	Basic memory management
Edge Device Support	Native support with optimizations	Limited edge support
Specialized Hardware	Support for FPGAs, IPUs, NPUs	Limited specialized hardware support

Scikit-Learn is primarily designed for CPU execution with limited GPU support, while Neurenix Hardware provides comprehensive support for various hardware accelerators. This makes Neurenix particularly well-suited for hardware-accelerated machine learning workloads.

Best Practices¶

Efficient Device Utilization¶

Choose the Right Device: Select the appropriate device for your workload.

# For large models with high computational requirements
neurenix.hardware.set_device("cuda:0")

# For deployment on edge devices
neurenix.hardware.set_device("wasm:simd")

# For cross-platform applications
neurenix.hardware.set_device("webgpu:0")

# Let Neurenix choose the best device
neurenix.hardware.set_device("auto")

Minimize Host-Device Transfers: Keep data on the device as much as possible.

# Bad practice: Frequent transfers between CPU and GPU
for i in range(100):
    x = neurenix.tensor([i], device="cpu")
    x = x.to("cuda:0")
    y = x * 2
    result = y.to("cpu")

# Good practice: Keep data on the device
x = neurenix.tensor(list(range(100)), device="cuda:0")
y = x * 2
result = y.to("cpu")  # Transfer only once at the end

Use Asynchronous Operations: Overlap computation and data transfer.

# Create streams for concurrent execution
stream1 = neurenix.hardware.Stream(device="cuda:0")
stream2 = neurenix.hardware.Stream(device="cuda:0")

# Execute operations concurrently
with stream1:
    a = neurenix.tensor([1, 2, 3], device="cuda:0")
    b = a * 2

with stream2:
    c = neurenix.tensor([4, 5, 6], device="cuda:0")
    d = c * 3

# Synchronize when results are needed
stream1.synchronize()
stream2.synchronize()

Batch Processing: Process data in batches to amortize overhead.

# Process data in batches
batch_size = 64
for i in range(0, len(data), batch_size):
    batch = data[i:i+batch_size].to("cuda:0")
    results = model(batch)
    # Process results

Memory Management¶

Monitor Memory Usage: Keep track of memory consumption.

# Check memory usage before and after operations
before_total, before_free, before_used = neurenix.hardware.memory_info("cuda:0")


# Perform operations
result = model(inputs)

after_total, after_free, after_used = neurenix.hardware.memory_info("cuda:0")
memory_used = after_used - before_used
print(f"Operation used {memory_used/1e6:.2f} MB")

Use Memory-Efficient Techniques: Implement techniques to reduce memory usage.

# Gradient checkpointing
model = neurenix.nn.Sequential(...)
model = neurenix.hardware.enable_checkpointing(model)

# Mixed precision training
with neurenix.hardware.autocast(dtype="float16"):
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)

Release Memory When Not Needed: Explicitly free memory when possible.

# Delete tensors when not needed
del tensor
neurenix.hardware.empty_cache()

# Use context managers for temporary allocations
with neurenix.hardware.temp_tensors():
    # Tensors created here will be automatically freed
    temp = neurenix.zeros((10000, 10000), device="cuda:0")
    # Use temp
# temp is automatically freed here

Optimization Techniques¶

Quantization: Reduce precision to improve performance.

# Dynamic quantization
quantized_model = neurenix.hardware.quantize(
    model=my_model,
    quantization_scheme="dynamic",
    dtype="int8"
)

# Static quantization with calibration
quantized_model = neurenix.hardware.quantize(
    model=my_model,
    quantization_scheme="static",
    dtype="int8",
    calibration_dataset=calibration_data
)

# Quantization-aware training
with neurenix.hardware.quantization_aware_training():
    for epoch in range(epochs):
        for inputs, targets in dataloader:
            outputs = model(inputs)
            loss = loss_fn(outputs, targets)
            loss.backward()
            optimizer.step()

Model Pruning: Remove redundant weights to reduce model size.

# Magnitude pruning
pruned_model = neurenix.hardware.prune(
    model=my_model,
    pruning_method="magnitude",
    sparsity=0.7
)

# Structured pruning
pruned_model = neurenix.hardware.prune(
    model=my_model,
    pruning_method="structured",
    sparsity=0.5,
    structure_type="channel"
)

# Gradual pruning during training
pruner = neurenix.hardware.GradualPruner(
    model=my_model,
    pruning_method="magnitude",
    initial_sparsity=0.0,
    final_sparsity=0.8,
    start_epoch=5,
    end_epoch=25
)

for epoch in range(epochs):
    pruner.step(epoch)
    for inputs, targets in dataloader:
        outputs = model(inputs)
        loss = loss_fn(outputs, targets)
        loss.backward()
        optimizer.step()

Operator Fusion: Combine multiple operations for better performance.

# Enable operator fusion
neurenix.hardware.config.set("operator_fusion", True)

# Create a model with fused operations
model = neurenix.nn.Sequential(...)
optimized_model = neurenix.hardware.optimize(
    model=model,
    fuse_operations=True
)

Multi-Device and Distributed Training¶

Data Parallel Training: Distribute training across multiple devices.

# Initialize distributed environment
neurenix.hardware.distributed.init_process_group(
    backend="nccl",
    init_method="env://",
    world_size=world_size,
    rank=rank
)

# Create model and move to device
model = MyModel()
model = model.to(f"cuda:{local_rank}")

# Wrap model with DistributedDataParallel
model = neurenix.hardware.distributed.DistributedDataParallel(
    model,
    device_ids=[local_rank],
    output_device=local_rank
)

# Create distributed sampler
train_sampler = neurenix.data.distributed.DistributedSampler(
    dataset=train_dataset,
    num_replicas=world_size,
    rank=rank
)

# Create data loader with distributed sampler
train_loader = neurenix.data.DataLoader(
    dataset=train_dataset,
    batch_size=batch_size,
    sampler=train_sampler,
    num_workers=4,
    pin_memory=True
)

# Training loop
for epoch in range(epochs):
    train_sampler.set_epoch(epoch)
    for inputs, targets in train_loader:
        inputs = inputs.to(f"cuda:{local_rank}")
        targets = targets.to(f"cuda:{local_rank}")

        outputs = model(inputs)
        loss = loss_fn(outputs, targets)
        loss.backward()
        optimizer.step()

Model Parallel Training: Split a model across multiple devices.

# Define a model with layers on different devices
class ModelParallelModel(neurenix.nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = neurenix.nn.Linear(1024, 512).to("cuda:0")
        self.layer2 = neurenix.nn.Linear(512, 256).to("cuda:1")
        self.layer3 = neurenix.nn.Linear(256, 10).to("cuda:2")

    def forward(self, x):
        x = x.to("cuda:0")
        x = self.layer1(x)
        x = x.to("cuda:1")
        x = self.layer2(x)
        x = x.to("cuda:2")
        x = self.layer3(x)
        return x

# Create model
model = ModelParallelModel()

# Training loop
for inputs, targets in dataloader:
    outputs = model(inputs)
    targets = targets.to("cuda:2")  # Move targets to the device of the output
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

Pipeline Parallel Training: Process different micro-batches on different devices.

from neurenix.hardware.distributed import PipelineParallel

# Define a model
model = neurenix.nn.Sequential(...)

# Create pipeline parallel model
pipeline_model = PipelineParallel(
    model=model,
    num_stages=4,
    devices=["cuda:0", "cuda:1", "cuda:2", "cuda:3"],
    chunks=4  # Number of micro-batches
)

# Training loop
for inputs, targets in dataloader:
    outputs = pipeline_model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

Tutorials¶

Basic Hardware Acceleration¶

import neurenix
from neurenix.hardware import get_device, set_device, memory_info
import time

# Check available devices
print("Available devices:")
for device_type in ["cpu", "cuda", "rocm", "webgpu", "vulkan", "opencl"]:
    if neurenix.hardware.is_available(device_type):
        count = neurenix.hardware.device_count(device_type)
        print(f"  {device_type}: {count} device(s)")

# Select the best available device
neurenix.hardware.set_device("auto")
current_device = neurenix.hardware.get_device()
print(f"Selected device: {current_device}")

# Create a simple model
model = neurenix.nn.Sequential(
    neurenix.nn.Linear(784, 256),
    neurenix.nn.ReLU(),
    neurenix.nn.Linear(256, 128),
    neurenix.nn.ReLU(),
    neurenix.nn.Linear(128, 10)
)

# Move model to device
model = model.to(current_device)

# Create random input data
batch_size = 64
input_data = neurenix.randn(batch_size, 784, device=current_device)

# Check memory before inference
total, free, used = neurenix.hardware.memory_info(current_device)
print(f"Memory before inference - Total: {total/1e9:.2f} GB, Free: {free/1e9:.2f} GB, Used: {used/1e9:.2f} GB")

# Warm-up
for _ in range(10):
    _ = model(input_data)

# Benchmark inference
num_iterations = 100
start_time = time.time()
for _ in range(num_iterations):
    _ = model(input_data)

# Synchronize to ensure all operations are complete
if current_device.startswith("cuda") or current_device.startswith("rocm"):
    neurenix.hardware.synchronize()

end_time = time.time()
elapsed_time = end_time - start_time
print(f"Inference time for {num_iterations} iterations: {elapsed_time:.4f} seconds")
print(f"Average time per batch: {elapsed_time / num_iterations * 1000:.2f} ms")

# Check memory after inference
total, free, used = neurenix.hardware.memory_info(current_device)
print(f"Memory after inference - Total: {total/1e9:.2f} GB, Free: {free/1e9:.2f} GB, Used: {used/1e9:.2f} GB")

# Clean up
neurenix.hardware.empty_cache()

Quantization and Optimization¶

import neurenix
from neurenix.hardware import quantize, optimize
from neurenix.data import DataLoader

# Load a pre-trained model
model = neurenix.models.resnet50(pretrained=True)
original_model = model.clone()  # Keep a copy of the original model

# Prepare a calibration dataset
calibration_dataset = neurenix.data.ImageFolder(
    root="/path/to/calibration/data",
    transform=neurenix.data.transforms.Compose([
        neurenix.data.transforms.Resize(256),
        neurenix.data.transforms.CenterCrop(224),
        neurenix.data.transforms.ToTensor(),
        neurenix.data.transforms.Normalize(
            mean=[0.485, 0.456, 0.406],
            std=[0.229, 0.224, 0.225]
        )
    ])
)

calibration_loader = DataLoader(
    calibration_dataset,
    batch_size=32,
    shuffle=True,
    num_workers=4
)

# Prepare evaluation dataset
eval_dataset = neurenix.data.ImageFolder(
    root="/path/to/eval/data",
    transform=neurenix.data.transforms.Compose([
        neurenix.data.transforms.Resize(256),
        neurenix.data.transforms.CenterCrop(224),
        neurenix.data.transforms.ToTensor(),
        neurenix.data.transforms.Normalize(
            mean=[0.485, 0.456, 0.406],
            std=[0.229, 0.224, 0.225]
        )
    ])
)

eval_loader = DataLoader(
    eval_dataset,
    batch_size=32,
    shuffle=False,
    num_workers=4
)

# Function to evaluate model
def evaluate_model(model, dataloader, device):
    model.eval()
    correct = 0
    total = 0
    with neurenix.no_grad():
        for inputs, targets in dataloader:
            inputs = inputs.to(device)
            targets = targets.to(device)
            outputs = model(inputs)
            _, predicted = outputs.max(1)
            total += targets.size(0)
            correct += predicted.eq(targets).sum().item()
    return correct / total * 100

# Set device
device = "cuda:0" if neurenix.hardware.is_available("cuda") else "cpu"
model = model.to(device)
original_model = original_model.to(device)

# Evaluate original model
print("Evaluating original model...")
original_accuracy = evaluate_model(original_model, eval_loader, device)
print(f"Original model accuracy: {original_accuracy:.2f}%")

# Quantize model to INT8
print("Quantizing model to INT8...")
quantized_model = neurenix.hardware.quantize(
    model=model.clone(),
    quantization_scheme="static",
    dtype="int8",
    calibration_dataset=calibration_loader,
    per_channel=True,
    symmetric=False
)

# Evaluate quantized model
print("Evaluating INT8 quantized model...")
quantized_accuracy = evaluate_model(quantized_model, eval_loader, device)
print(f"INT8 quantized model accuracy: {quantized_accuracy:.2f}%")
print(f"Accuracy difference: {original_accuracy - quantized_accuracy:.2f}%")

# Optimize model with operator fusion
print("Optimizing model with operator fusion...")
optimized_model = neurenix.hardware.optimize(
    model=model.clone(),
    target_device=device,
    optimization_level=2,
    fuse_operations=True,
    enable_tensor_cores=True if device.startswith("cuda") else False
)

# Evaluate optimized model
print("Evaluating optimized model...")
optimized_accuracy = evaluate_model(optimized_model, eval_loader, device)
print(f"Optimized model accuracy: {optimized_accuracy:.2f}%")
print(f"Accuracy difference: {original_accuracy - optimized_accuracy:.2f}%")

# Combine quantization and optimization
print("Applying both quantization and optimization...")
quantized_optimized_model = neurenix.hardware.optimize(
    model=quantized_model.clone(),
    target_device=device,
    optimization_level=2,
    fuse_operations=True,
    enable_tensor_cores=True if device.startswith("cuda") else False
)

# Evaluate combined model
print("Evaluating quantized and optimized model...")
combined_accuracy = evaluate_model(quantized_optimized_model, eval_loader, device)
print(f"Quantized and optimized model accuracy: {combined_accuracy:.2f}%")
print(f"Accuracy difference: {original_accuracy - combined_accuracy:.2f}%")

# Compare model sizes
original_size = neurenix.hardware.model_size(original_model)
quantized_size = neurenix.hardware.model_size(quantized_model)
optimized_size = neurenix.hardware.model_size(optimized_model)
combined_size = neurenix.hardware.model_size(quantized_optimized_model)

print(f"Original model size: {original_size/1e6:.2f} MB")
print(f"Quantized model size: {quantized_size/1e6:.2f} MB ({quantized_size/original_size*100:.2f}% of original)")
print(f"Optimized model size: {optimized_size/1e6:.2f} MB ({optimized_size/original_size*100:.2f}% of original)")
print(f"Quantized and optimized model size: {combined_size/1e6:.2f} MB ({combined_size/original_size*100:.2f}% of original)")

# Benchmark inference speed
def benchmark_inference(model, input_tensor, num_iterations=100):
    model.eval()
    # Warm-up
    for _ in range(10):
        _ = model(input_tensor)

    # Synchronize before timing
    if device.startswith("cuda"):
        neurenix.hardware.synchronize()

    start_time = time.time()
    for _ in range(num_iterations):
        _ = model(input_tensor)

    # Synchronize after timing
    if device.startswith("cuda"):
        neurenix.hardware.synchronize()

    end_time = time.time()
    elapsed_time = end_time - start_time
    return elapsed_time / num_iterations * 1000  # ms per inference

# Create a sample input tensor
input_tensor = neurenix.randn(1, 3, 224, 224, device=device)

# Benchmark all models
original_time = benchmark_inference(original_model, input_tensor)
quantized_time = benchmark_inference(quantized_model, input_tensor)
optimized_time = benchmark_inference(optimized_model, input_tensor)
combined_time = benchmark_inference(quantized_optimized_model, input_tensor)

print(f"Original model inference time: {original_time:.2f} ms per image")
print(f"Quantized model inference time: {quantized_time:.2f} ms per image ({original_time/quantized_time:.2f}x speedup)")
print(f"Optimized model inference time: {optimized_time:.2f} ms per image ({original_time/optimized_time:.2f}x speedup)")
print(f"Quantized and optimized model inference time: {combined_time:.2f} ms per image ({original_time/combined_time:.2f}x speedup)")

# Save the optimized model
neurenix.save(quantized_optimized_model, "quantized_optimized_resnet50.nrx")
print("Saved quantized and optimized model to quantized_optimized_resnet50.nrx")

Multi-GPU Distributed Training¶

import neurenix
import os
import argparse
from neurenix.hardware import distributed
from neurenix.data import DataLoader, DistributedSampler

# Parse arguments
parser = argparse.ArgumentParser(description='Distributed training example')
parser.add_argument('--nodes', default=1, type=int, help='number of nodes')
parser.add_argument('--gpus', default=1, type=int, help='number of gpus per node')
parser.add_argument('--node_rank', default=0, type=int, help='ranking within the nodes')
parser.add_argument('--master_addr', default='localhost', type=str, help='address of master node')
parser.add_argument('--master_port', default='12355', type=str, help='port of master node')
args = parser.parse_args()

# Calculate world_size and rank
args.world_size = args.gpus * args.nodes
os.environ['MASTER_ADDR'] = args.master_addr
os.environ['MASTER_PORT'] = args.master_port

# Define the training function
def train(gpu, args):
    # Calculate rank
    rank = args.node_rank * args.gpus + gpu

    # Initialize distributed process group
    distributed.init_process_group(
        backend='nccl',
        init_method='env://',
        world_size=args.world_size,
        rank=rank
    )

    # Create model
    model = neurenix.models.resnet18()

    # Move model to GPU
    device = f'cuda:{gpu}'
    model = model.to(device)

    # Wrap model with DistributedDataParallel
    model = distributed.DistributedDataParallel(
        model,
        device_ids=[gpu],
        output_device=gpu
    )

    # Create dataset and sampler
    train_dataset = neurenix.data.CIFAR10(
        root='/path/to/data',
        train=True,
        download=True,
        transform=neurenix.data.transforms.Compose([
            neurenix.data.transforms.RandomCrop(32, padding=4),
            neurenix.data.transforms.RandomHorizontalFlip(),
            neurenix.data.transforms.ToTensor(),
            neurenix.data.transforms.Normalize(
                mean=[0.4914, 0.4822, 0.4465],
                std=[0.2470, 0.2435, 0.2616]
            )
        ])
    )

    train_sampler = DistributedSampler(
        dataset=train_dataset,
        num_replicas=args.world_size,
        rank=rank
    )

    train_loader = DataLoader(
        dataset=train_dataset,
        batch_size=128,
        shuffle=False,  # Sampler handles shuffling
        num_workers=4,
        pin_memory=True,
        sampler=train_sampler
    )

    # Define optimizer and loss function
    optimizer = neurenix.optim.SGD(
        model.parameters(),
        lr=0.1,
        momentum=0.9,
        weight_decay=1e-4
    )

    criterion = neurenix.nn.CrossEntropyLoss()

    # Training loop
    for epoch in range(10):
        train_sampler.set_epoch(epoch)
        model.train()

        for i, (images, target) in enumerate(train_loader):
            images = images.to(device)
            target = target.to(device)

            # Forward pass
            output = model(images)
            loss = criterion(output, target)

            # Backward pass and optimize
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            if i % 100 == 0 and rank == 0:
                print(f'Epoch: {epoch}, Batch: {i}, Loss: {loss.item()}')

        # Save checkpoint (only on rank 0)
        if rank == 0:
            neurenix.save({
                'epoch': epoch,
                'model_state_dict': model.module.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
            }, f'checkpoint_epoch_{epoch}.nrx')

    # Clean up
    distributed.destroy_process_group()

# Launch training processes
if __name__ == '__main__':
    # Use multiprocessing to launch multiple processes on a single node
    import neurenix.hardware.multiprocessing as mp
    mp.spawn(train, nprocs=args.gpus, args=(args,))

Conclusion¶

The Hardware Acceleration module in Neurenix provides a comprehensive set of tools for leveraging various hardware accelerators to optimize the performance of machine learning workloads. Its unified device abstraction, support for multiple hardware backends, automatic optimization techniques, and distributed execution capabilities make it a powerful solution for accelerating machine learning applications across different hardware platforms.

Compared to other frameworks like TensorFlow, PyTorch, and Scikit-Learn, Neurenix Hardware offers advantages in terms of hardware support, optimization techniques, and ease of use. Its multi-language architecture ensures optimal performance while maintaining a user-friendly interface, making it suitable for both research and production environments.

By following the best practices and tutorials outlined in this documentation, users can leverage the full power of their hardware to accelerate machine learning workloads, from training large models on high-performance GPUs to deploying optimized models on edge devices.

Page

Table Of Contents

Hardware Acceleration Module¶

Overview¶

Key Concepts¶

Device Abstraction¶

Hardware Backends¶

Automatic Optimization¶

Distributed Execution¶

API Reference¶

Device Management¶

Memory Management¶

Stream Management¶

Backend Configuration¶

Automatic Optimization¶

Distributed Execution¶

Hardware-Specific Features¶

NVIDIA Tensor Cores¶

WebAssembly SIMD¶

FPGA Acceleration¶

Framework Comparison¶

Neurenix Hardware vs. TensorFlow Hardware¶

Neurenix Hardware vs. PyTorch Hardware¶

Neurenix Hardware vs. Scikit-Learn Hardware¶

Best Practices¶

Efficient Device Utilization¶

Memory Management¶

Optimization Techniques¶

Multi-Device and Distributed Training¶

Tutorials¶

Basic Hardware Acceleration¶

Quantization and Optimization¶

Multi-GPU Distributed Training¶

Conclusion¶