The Hardware Acceleration module in Neurenix provides comprehensive support for leveraging various hardware accelerators to optimize the performance of machine learning workloads. This module enables efficient execution of tensor operations, neural network training, and inference across a wide range of hardware platforms, from high-performance GPUs to specialized AI accelerators and edge devices.
Built on Neurenix's multi-language architecture, the Hardware Acceleration module implements high-performance operations in Rust and C++, with a user-friendly Python interface. This design ensures optimal performance while maintaining ease of use for researchers and developers.
The module supports multiple hardware acceleration technologies beyond basic CUDA, ROCm, and WebGPU implementations, including WebAssembly SIMD, Vulkan, OpenCL, DirectML, TensorRT, and specialized hardware like GraphCore IPUs and FPGAs. It also provides automatic device selection, graceful fallbacks, and comprehensive optimization techniques such as quantization and model pruning.
The Hardware Acceleration module provides a unified device abstraction layer that enables consistent interaction with different hardware accelerators:
This abstraction allows users to write hardware-agnostic code that can run efficiently on various devices without modification.
The module supports multiple hardware backends, each optimized for specific hardware platforms:
Each backend implements the same interface, allowing seamless switching between different hardware platforms.
The module provides automatic optimization techniques to maximize performance on different hardware:
These optimizations are applied transparently, allowing users to focus on model development rather than hardware-specific optimizations.
The module supports distributed execution across multiple devices and nodes:
This enables efficient scaling of training and inference workloads to large distributed systems.
import neurenix
from neurenix.hardware import get_device, set_device, device_count, is_available
# Check device availability
cuda_available = neurenix.hardware.is_available("cuda")
rocm_available = neurenix.hardware.is_available("rocm")
webgpu_available = neurenix.hardware.is_available("webgpu")
# Get device count
num_cuda_devices = neurenix.hardware.device_count("cuda")
# Get current device
current_device = neurenix.hardware.get_device()
# Set device
neurenix.hardware.set_device("cuda:0")
# Device context manager
with neurenix.hardware.device("cuda:1"):
# Code in this block runs on CUDA device 1
tensor = neurenix.tensor([1, 2, 3], device="cuda:1")
# Automatic device selection
neurenix.hardware.set_device("auto") # Selects the best available device
from neurenix.hardware import memory_info, empty_cache, set_memory_fraction
# Get memory information
total, free, used = neurenix.hardware.memory_info("cuda:0")
print(f"Total: {total/1e9:.2f} GB, Free: {free/1e9:.2f} GB, Used: {used/1e9:.2f} GB")
# Empty cache
neurenix.hardware.empty_cache()
# Set memory fraction to use
neurenix.hardware.set_memory_fraction(0.8, "cuda:0") # Use up to 80% of GPU memory
# Pinned memory allocation
pinned_tensor = neurenix.tensor([1, 2, 3], pin_memory=True)
# Unified memory allocation (accessible from both CPU and GPU)
unified_tensor = neurenix.tensor([1, 2, 3], memory="unified")
from neurenix.hardware import Stream, Event
# Create a stream
stream = neurenix.hardware.Stream(device="cuda:0")
# Execute operations on a stream
with stream:
a = neurenix.tensor([1, 2, 3], device="cuda:0")
b = neurenix.tensor([4, 5, 6], device="cuda:0")
c = a + b
# Create an event
event = neurenix.hardware.Event(device="cuda:0")
# Record an event
event.record(stream)
# Wait for an event
event.wait()
# Synchronize a stream
stream.synchronize()
# Query if an event has completed
if event.query():
print("Event completed")
from neurenix.hardware import set_backend, get_backend, available_backends
# Get available backends
backends = neurenix.hardware.available_backends()
print(f"Available backends: {backends}")
# Set backend
neurenix.hardware.set_backend("cuda")
# Get current backend
current_backend = neurenix.hardware.get_backend()
# Backend context manager
with neurenix.hardware.backend("rocm"):
# Code in this block uses ROCm backend
tensor = neurenix.tensor([1, 2, 3], device="rocm:0")
# Backend-specific configuration
neurenix.hardware.config.cuda.set("tensor_cores", True)
neurenix.hardware.config.rocm.set("hip_graph", True)
neurenix.hardware.config.webgpu.set("preferred_backend", "dawn")
from neurenix.hardware import optimize, quantize, prune
# Optimize a model for specific hardware
optimized_model = neurenix.hardware.optimize(
model=my_model,
target_device="cuda:0",
optimization_level=3, # 0-3, higher means more aggressive optimization
precision="mixed", # "fp32", "fp16", "int8", "mixed"
fuse_operations=True,
enable_tensor_cores=True
)
# Quantize a model
quantized_model = neurenix.hardware.quantize(
model=my_model,
quantization_scheme="int8", # "int8", "fp16", "fp8"
calibration_dataset=calibration_data,
per_channel=True,
symmetric=False
)
# Prune a model
pruned_model = neurenix.hardware.prune(
model=my_model,
pruning_method="magnitude", # "magnitude", "structured", "lottery_ticket"
sparsity=0.7, # Target sparsity level
schedule="gradual" # "one_shot" or "gradual"
)
from neurenix.hardware import distributed
# Initialize distributed environment
neurenix.hardware.distributed.init_process_group(
backend="nccl", # "nccl", "gloo", "mpi", "horovod"
init_method="env://",
world_size=4,
rank=0
)
# Get distributed information
rank = neurenix.hardware.distributed.get_rank()
world_size = neurenix.hardware.distributed.get_world_size()
local_rank = neurenix.hardware.distributed.get_local_rank()
# Distributed data parallel model
model = neurenix.nn.Sequential(...)
distributed_model = neurenix.hardware.distributed.DistributedDataParallel(
model,
device_ids=[local_rank],
output_device=local_rank,
broadcast_buffers=True,
find_unused_parameters=False
)
# All-reduce operation
tensor = neurenix.tensor([1, 2, 3], device=f"cuda:{local_rank}")
neurenix.hardware.distributed.all_reduce(tensor, op="sum")
# Barrier synchronization
neurenix.hardware.distributed.barrier()
# Cleanup
neurenix.hardware.distributed.destroy_process_group()
from neurenix.hardware.cuda import enable_tensor_cores, is_tensor_cores_available
# Check if Tensor Cores are available
if neurenix.hardware.cuda.is_tensor_cores_available():
# Enable Tensor Cores
neurenix.hardware.cuda.enable_tensor_cores(True)
# Create a model with Tensor Core optimizations
model = neurenix.nn.Sequential(...)
model = neurenix.hardware.cuda.optimize_for_tensor_cores(model)
from neurenix.hardware.wasm import enable_simd, is_simd_available
# Check if WASM SIMD is available
if neurenix.hardware.wasm.is_simd_available():
# Enable WASM SIMD
neurenix.hardware.wasm.enable_simd(True)
# Create a model optimized for WASM SIMD
model = neurenix.nn.Sequential(...)
model = neurenix.hardware.wasm.optimize_for_simd(model)
from neurenix.hardware.fpga import compile_for_fpga, load_bitstream
# Compile a model for FPGA
bitstream = neurenix.hardware.fpga.compile_for_fpga(
model=my_model,
target_device="xilinx_u250",
optimization_level=3,
precision="int8"
)
# Save bitstream
bitstream.save("/path/to/bitstream.bit")
# Load bitstream
loaded_model = neurenix.hardware.fpga.load_bitstream(
bitstream_path="/path/to/bitstream.bit",
device="fpga:0"
)
| Feature | Neurenix Hardware | TensorFlow Hardware |
|---|---|---|
| Device Abstraction | Unified API for all hardware | Separate APIs for different hardware |
| Hardware Support | Comprehensive support for CUDA, ROCm, WebGPU, WASM, Vulkan, OpenCL, DirectML, FPGA, IPU, NPU | Primarily focused on CUDA, ROCm, and TPU |
| Edge Device Support | Native support with optimizations | Limited through TensorFlow Lite |
| Quantization | Automatic and manual quantization with multiple schemes | Limited quantization options |
| Model Pruning | Built-in pruning with various methods | Requires separate tools |
| Distributed Training | Integrated with multiple backends | Primarily focused on TensorFlow-specific distribution |
| Memory Management | Advanced memory management with unified memory | Basic memory management |
| WebAssembly Support | First-class WASM and WASI-NN support | Limited WASM support |
Neurenix Hardware provides more comprehensive hardware support compared to TensorFlow, with better integration of various acceleration technologies and more advanced optimization techniques. Its unified API makes it easier to target different hardware platforms without code changes.
| Feature | Neurenix Hardware | PyTorch Hardware |
|---|---|---|
| Device Abstraction | Unified API with automatic fallbacks | Device-specific APIs |
| Hardware Support | Comprehensive support for various hardware | Primarily focused on CUDA with limited support for other hardware |
| Quantization | Integrated quantization with multiple schemes | Separate quantization toolkit |
| Model Pruning | Built-in pruning with various methods | Requires separate tools |
| Distributed Training | Integrated with multiple backends | Primarily focused on PyTorch-specific distribution |
| Memory Management | Advanced memory management with unified memory | Basic memory management |
| Edge Device Support | Native support with optimizations | Limited through separate tools |
| WebGPU Support | First-class WebGPU support | Limited WebGPU support |
While PyTorch offers good CUDA support, Neurenix Hardware provides more comprehensive hardware support with better integration of various acceleration technologies. Its unified API and automatic optimizations make it easier to target different hardware platforms efficiently.
| Feature | Neurenix Hardware | Scikit-Learn Hardware |
|---|---|---|
| Device Abstraction | Unified API for all hardware | Limited hardware abstraction |
| Hardware Support | Comprehensive support for various hardware | Primarily CPU-focused with limited GPU support |
| Quantization | Automatic and manual quantization | Limited quantization options |
| Model Pruning | Built-in pruning with various methods | Limited pruning capabilities |
| Distributed Training | Integrated with multiple backends | Limited to joblib parallelism |
| Memory Management | Advanced memory management | Basic memory management |
| Edge Device Support | Native support with optimizations | Limited edge support |
| Specialized Hardware | Support for FPGAs, IPUs, NPUs | Limited specialized hardware support |
Scikit-Learn is primarily designed for CPU execution with limited GPU support, while Neurenix Hardware provides comprehensive support for various hardware accelerators. This makes Neurenix particularly well-suited for hardware-accelerated machine learning workloads.
# For large models with high computational requirements
neurenix.hardware.set_device("cuda:0")
# For deployment on edge devices
neurenix.hardware.set_device("wasm:simd")
# For cross-platform applications
neurenix.hardware.set_device("webgpu:0")
# Let Neurenix choose the best device
neurenix.hardware.set_device("auto")
# Bad practice: Frequent transfers between CPU and GPU
for i in range(100):
x = neurenix.tensor([i], device="cpu")
x = x.to("cuda:0")
y = x * 2
result = y.to("cpu")
# Good practice: Keep data on the device
x = neurenix.tensor(list(range(100)), device="cuda:0")
y = x * 2
result = y.to("cpu") # Transfer only once at the end
# Create streams for concurrent execution
stream1 = neurenix.hardware.Stream(device="cuda:0")
stream2 = neurenix.hardware.Stream(device="cuda:0")
# Execute operations concurrently
with stream1:
a = neurenix.tensor([1, 2, 3], device="cuda:0")
b = a * 2
with stream2:
c = neurenix.tensor([4, 5, 6], device="cuda:0")
d = c * 3
# Synchronize when results are needed
stream1.synchronize()
stream2.synchronize()
# Process data in batches
batch_size = 64
for i in range(0, len(data), batch_size):
batch = data[i:i+batch_size].to("cuda:0")
results = model(batch)
# Process results
# Check memory usage before and after operations
before_total, before_free, before_used = neurenix.hardware.memory_info("cuda:0")
# Perform operations
result = model(inputs)
after_total, after_free, after_used = neurenix.hardware.memory_info("cuda:0")
memory_used = after_used - before_used
print(f"Operation used {memory_used/1e6:.2f} MB")
# Gradient checkpointing
model = neurenix.nn.Sequential(...)
model = neurenix.hardware.enable_checkpointing(model)
# Mixed precision training
with neurenix.hardware.autocast(dtype="float16"):
outputs = model(inputs)
loss = loss_fn(outputs, targets)
# Delete tensors when not needed
del tensor
neurenix.hardware.empty_cache()
# Use context managers for temporary allocations
with neurenix.hardware.temp_tensors():
# Tensors created here will be automatically freed
temp = neurenix.zeros((10000, 10000), device="cuda:0")
# Use temp
# temp is automatically freed here
# Dynamic quantization
quantized_model = neurenix.hardware.quantize(
model=my_model,
quantization_scheme="dynamic",
dtype="int8"
)
# Static quantization with calibration
quantized_model = neurenix.hardware.quantize(
model=my_model,
quantization_scheme="static",
dtype="int8",
calibration_dataset=calibration_data
)
# Quantization-aware training
with neurenix.hardware.quantization_aware_training():
for epoch in range(epochs):
for inputs, targets in dataloader:
outputs = model(inputs)
loss = loss_fn(outputs, targets)
loss.backward()
optimizer.step()
# Magnitude pruning
pruned_model = neurenix.hardware.prune(
model=my_model,
pruning_method="magnitude",
sparsity=0.7
)
# Structured pruning
pruned_model = neurenix.hardware.prune(
model=my_model,
pruning_method="structured",
sparsity=0.5,
structure_type="channel"
)
# Gradual pruning during training
pruner = neurenix.hardware.GradualPruner(
model=my_model,
pruning_method="magnitude",
initial_sparsity=0.0,
final_sparsity=0.8,
start_epoch=5,
end_epoch=25
)
for epoch in range(epochs):
pruner.step(epoch)
for inputs, targets in dataloader:
outputs = model(inputs)
loss = loss_fn(outputs, targets)
loss.backward()
optimizer.step()
# Enable operator fusion
neurenix.hardware.config.set("operator_fusion", True)
# Create a model with fused operations
model = neurenix.nn.Sequential(...)
optimized_model = neurenix.hardware.optimize(
model=model,
fuse_operations=True
)
# Initialize distributed environment
neurenix.hardware.distributed.init_process_group(
backend="nccl",
init_method="env://",
world_size=world_size,
rank=rank
)
# Create model and move to device
model = MyModel()
model = model.to(f"cuda:{local_rank}")
# Wrap model with DistributedDataParallel
model = neurenix.hardware.distributed.DistributedDataParallel(
model,
device_ids=[local_rank],
output_device=local_rank
)
# Create distributed sampler
train_sampler = neurenix.data.distributed.DistributedSampler(
dataset=train_dataset,
num_replicas=world_size,
rank=rank
)
# Create data loader with distributed sampler
train_loader = neurenix.data.DataLoader(
dataset=train_dataset,
batch_size=batch_size,
sampler=train_sampler,
num_workers=4,
pin_memory=True
)
# Training loop
for epoch in range(epochs):
train_sampler.set_epoch(epoch)
for inputs, targets in train_loader:
inputs = inputs.to(f"cuda:{local_rank}")
targets = targets.to(f"cuda:{local_rank}")
outputs = model(inputs)
loss = loss_fn(outputs, targets)
loss.backward()
optimizer.step()
# Define a model with layers on different devices
class ModelParallelModel(neurenix.nn.Module):
def __init__(self):
super().__init__()
self.layer1 = neurenix.nn.Linear(1024, 512).to("cuda:0")
self.layer2 = neurenix.nn.Linear(512, 256).to("cuda:1")
self.layer3 = neurenix.nn.Linear(256, 10).to("cuda:2")
def forward(self, x):
x = x.to("cuda:0")
x = self.layer1(x)
x = x.to("cuda:1")
x = self.layer2(x)
x = x.to("cuda:2")
x = self.layer3(x)
return x
# Create model
model = ModelParallelModel()
# Training loop
for inputs, targets in dataloader:
outputs = model(inputs)
targets = targets.to("cuda:2") # Move targets to the device of the output
loss = loss_fn(outputs, targets)
loss.backward()
optimizer.step()
from neurenix.hardware.distributed import PipelineParallel
# Define a model
model = neurenix.nn.Sequential(...)
# Create pipeline parallel model
pipeline_model = PipelineParallel(
model=model,
num_stages=4,
devices=["cuda:0", "cuda:1", "cuda:2", "cuda:3"],
chunks=4 # Number of micro-batches
)
# Training loop
for inputs, targets in dataloader:
outputs = pipeline_model(inputs)
loss = loss_fn(outputs, targets)
loss.backward()
optimizer.step()
import neurenix
from neurenix.hardware import get_device, set_device, memory_info
import time
# Check available devices
print("Available devices:")
for device_type in ["cpu", "cuda", "rocm", "webgpu", "vulkan", "opencl"]:
if neurenix.hardware.is_available(device_type):
count = neurenix.hardware.device_count(device_type)
print(f" {device_type}: {count} device(s)")
# Select the best available device
neurenix.hardware.set_device("auto")
current_device = neurenix.hardware.get_device()
print(f"Selected device: {current_device}")
# Create a simple model
model = neurenix.nn.Sequential(
neurenix.nn.Linear(784, 256),
neurenix.nn.ReLU(),
neurenix.nn.Linear(256, 128),
neurenix.nn.ReLU(),
neurenix.nn.Linear(128, 10)
)
# Move model to device
model = model.to(current_device)
# Create random input data
batch_size = 64
input_data = neurenix.randn(batch_size, 784, device=current_device)
# Check memory before inference
total, free, used = neurenix.hardware.memory_info(current_device)
print(f"Memory before inference - Total: {total/1e9:.2f} GB, Free: {free/1e9:.2f} GB, Used: {used/1e9:.2f} GB")
# Warm-up
for _ in range(10):
_ = model(input_data)
# Benchmark inference
num_iterations = 100
start_time = time.time()
for _ in range(num_iterations):
_ = model(input_data)
# Synchronize to ensure all operations are complete
if current_device.startswith("cuda") or current_device.startswith("rocm"):
neurenix.hardware.synchronize()
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Inference time for {num_iterations} iterations: {elapsed_time:.4f} seconds")
print(f"Average time per batch: {elapsed_time / num_iterations * 1000:.2f} ms")
# Check memory after inference
total, free, used = neurenix.hardware.memory_info(current_device)
print(f"Memory after inference - Total: {total/1e9:.2f} GB, Free: {free/1e9:.2f} GB, Used: {used/1e9:.2f} GB")
# Clean up
neurenix.hardware.empty_cache()
import neurenix
from neurenix.hardware import quantize, optimize
from neurenix.data import DataLoader
# Load a pre-trained model
model = neurenix.models.resnet50(pretrained=True)
original_model = model.clone() # Keep a copy of the original model
# Prepare a calibration dataset
calibration_dataset = neurenix.data.ImageFolder(
root="/path/to/calibration/data",
transform=neurenix.data.transforms.Compose([
neurenix.data.transforms.Resize(256),
neurenix.data.transforms.CenterCrop(224),
neurenix.data.transforms.ToTensor(),
neurenix.data.transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
])
)
calibration_loader = DataLoader(
calibration_dataset,
batch_size=32,
shuffle=True,
num_workers=4
)
# Prepare evaluation dataset
eval_dataset = neurenix.data.ImageFolder(
root="/path/to/eval/data",
transform=neurenix.data.transforms.Compose([
neurenix.data.transforms.Resize(256),
neurenix.data.transforms.CenterCrop(224),
neurenix.data.transforms.ToTensor(),
neurenix.data.transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
])
)
eval_loader = DataLoader(
eval_dataset,
batch_size=32,
shuffle=False,
num_workers=4
)
# Function to evaluate model
def evaluate_model(model, dataloader, device):
model.eval()
correct = 0
total = 0
with neurenix.no_grad():
for inputs, targets in dataloader:
inputs = inputs.to(device)
targets = targets.to(device)
outputs = model(inputs)
_, predicted = outputs.max(1)
total += targets.size(0)
correct += predicted.eq(targets).sum().item()
return correct / total * 100
# Set device
device = "cuda:0" if neurenix.hardware.is_available("cuda") else "cpu"
model = model.to(device)
original_model = original_model.to(device)
# Evaluate original model
print("Evaluating original model...")
original_accuracy = evaluate_model(original_model, eval_loader, device)
print(f"Original model accuracy: {original_accuracy:.2f}%")
# Quantize model to INT8
print("Quantizing model to INT8...")
quantized_model = neurenix.hardware.quantize(
model=model.clone(),
quantization_scheme="static",
dtype="int8",
calibration_dataset=calibration_loader,
per_channel=True,
symmetric=False
)
# Evaluate quantized model
print("Evaluating INT8 quantized model...")
quantized_accuracy = evaluate_model(quantized_model, eval_loader, device)
print(f"INT8 quantized model accuracy: {quantized_accuracy:.2f}%")
print(f"Accuracy difference: {original_accuracy - quantized_accuracy:.2f}%")
# Optimize model with operator fusion
print("Optimizing model with operator fusion...")
optimized_model = neurenix.hardware.optimize(
model=model.clone(),
target_device=device,
optimization_level=2,
fuse_operations=True,
enable_tensor_cores=True if device.startswith("cuda") else False
)
# Evaluate optimized model
print("Evaluating optimized model...")
optimized_accuracy = evaluate_model(optimized_model, eval_loader, device)
print(f"Optimized model accuracy: {optimized_accuracy:.2f}%")
print(f"Accuracy difference: {original_accuracy - optimized_accuracy:.2f}%")
# Combine quantization and optimization
print("Applying both quantization and optimization...")
quantized_optimized_model = neurenix.hardware.optimize(
model=quantized_model.clone(),
target_device=device,
optimization_level=2,
fuse_operations=True,
enable_tensor_cores=True if device.startswith("cuda") else False
)
# Evaluate combined model
print("Evaluating quantized and optimized model...")
combined_accuracy = evaluate_model(quantized_optimized_model, eval_loader, device)
print(f"Quantized and optimized model accuracy: {combined_accuracy:.2f}%")
print(f"Accuracy difference: {original_accuracy - combined_accuracy:.2f}%")
# Compare model sizes
original_size = neurenix.hardware.model_size(original_model)
quantized_size = neurenix.hardware.model_size(quantized_model)
optimized_size = neurenix.hardware.model_size(optimized_model)
combined_size = neurenix.hardware.model_size(quantized_optimized_model)
print(f"Original model size: {original_size/1e6:.2f} MB")
print(f"Quantized model size: {quantized_size/1e6:.2f} MB ({quantized_size/original_size*100:.2f}% of original)")
print(f"Optimized model size: {optimized_size/1e6:.2f} MB ({optimized_size/original_size*100:.2f}% of original)")
print(f"Quantized and optimized model size: {combined_size/1e6:.2f} MB ({combined_size/original_size*100:.2f}% of original)")
# Benchmark inference speed
def benchmark_inference(model, input_tensor, num_iterations=100):
model.eval()
# Warm-up
for _ in range(10):
_ = model(input_tensor)
# Synchronize before timing
if device.startswith("cuda"):
neurenix.hardware.synchronize()
start_time = time.time()
for _ in range(num_iterations):
_ = model(input_tensor)
# Synchronize after timing
if device.startswith("cuda"):
neurenix.hardware.synchronize()
end_time = time.time()
elapsed_time = end_time - start_time
return elapsed_time / num_iterations * 1000 # ms per inference
# Create a sample input tensor
input_tensor = neurenix.randn(1, 3, 224, 224, device=device)
# Benchmark all models
original_time = benchmark_inference(original_model, input_tensor)
quantized_time = benchmark_inference(quantized_model, input_tensor)
optimized_time = benchmark_inference(optimized_model, input_tensor)
combined_time = benchmark_inference(quantized_optimized_model, input_tensor)
print(f"Original model inference time: {original_time:.2f} ms per image")
print(f"Quantized model inference time: {quantized_time:.2f} ms per image ({original_time/quantized_time:.2f}x speedup)")
print(f"Optimized model inference time: {optimized_time:.2f} ms per image ({original_time/optimized_time:.2f}x speedup)")
print(f"Quantized and optimized model inference time: {combined_time:.2f} ms per image ({original_time/combined_time:.2f}x speedup)")
# Save the optimized model
neurenix.save(quantized_optimized_model, "quantized_optimized_resnet50.nrx")
print("Saved quantized and optimized model to quantized_optimized_resnet50.nrx")
import neurenix
import os
import argparse
from neurenix.hardware import distributed
from neurenix.data import DataLoader, DistributedSampler
# Parse arguments
parser = argparse.ArgumentParser(description='Distributed training example')
parser.add_argument('--nodes', default=1, type=int, help='number of nodes')
parser.add_argument('--gpus', default=1, type=int, help='number of gpus per node')
parser.add_argument('--node_rank', default=0, type=int, help='ranking within the nodes')
parser.add_argument('--master_addr', default='localhost', type=str, help='address of master node')
parser.add_argument('--master_port', default='12355', type=str, help='port of master node')
args = parser.parse_args()
# Calculate world_size and rank
args.world_size = args.gpus * args.nodes
os.environ['MASTER_ADDR'] = args.master_addr
os.environ['MASTER_PORT'] = args.master_port
# Define the training function
def train(gpu, args):
# Calculate rank
rank = args.node_rank * args.gpus + gpu
# Initialize distributed process group
distributed.init_process_group(
backend='nccl',
init_method='env://',
world_size=args.world_size,
rank=rank
)
# Create model
model = neurenix.models.resnet18()
# Move model to GPU
device = f'cuda:{gpu}'
model = model.to(device)
# Wrap model with DistributedDataParallel
model = distributed.DistributedDataParallel(
model,
device_ids=[gpu],
output_device=gpu
)
# Create dataset and sampler
train_dataset = neurenix.data.CIFAR10(
root='/path/to/data',
train=True,
download=True,
transform=neurenix.data.transforms.Compose([
neurenix.data.transforms.RandomCrop(32, padding=4),
neurenix.data.transforms.RandomHorizontalFlip(),
neurenix.data.transforms.ToTensor(),
neurenix.data.transforms.Normalize(
mean=[0.4914, 0.4822, 0.4465],
std=[0.2470, 0.2435, 0.2616]
)
])
)
train_sampler = DistributedSampler(
dataset=train_dataset,
num_replicas=args.world_size,
rank=rank
)
train_loader = DataLoader(
dataset=train_dataset,
batch_size=128,
shuffle=False, # Sampler handles shuffling
num_workers=4,
pin_memory=True,
sampler=train_sampler
)
# Define optimizer and loss function
optimizer = neurenix.optim.SGD(
model.parameters(),
lr=0.1,
momentum=0.9,
weight_decay=1e-4
)
criterion = neurenix.nn.CrossEntropyLoss()
# Training loop
for epoch in range(10):
train_sampler.set_epoch(epoch)
model.train()
for i, (images, target) in enumerate(train_loader):
images = images.to(device)
target = target.to(device)
# Forward pass
output = model(images)
loss = criterion(output, target)
# Backward pass and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
if i % 100 == 0 and rank == 0:
print(f'Epoch: {epoch}, Batch: {i}, Loss: {loss.item()}')
# Save checkpoint (only on rank 0)
if rank == 0:
neurenix.save({
'epoch': epoch,
'model_state_dict': model.module.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
}, f'checkpoint_epoch_{epoch}.nrx')
# Clean up
distributed.destroy_process_group()
# Launch training processes
if __name__ == '__main__':
# Use multiprocessing to launch multiple processes on a single node
import neurenix.hardware.multiprocessing as mp
mp.spawn(train, nprocs=args.gpus, args=(args,))
The Hardware Acceleration module in Neurenix provides a comprehensive set of tools for leveraging various hardware accelerators to optimize the performance of machine learning workloads. Its unified device abstraction, support for multiple hardware backends, automatic optimization techniques, and distributed execution capabilities make it a powerful solution for accelerating machine learning applications across different hardware platforms.
Compared to other frameworks like TensorFlow, PyTorch, and Scikit-Learn, Neurenix Hardware offers advantages in terms of hardware support, optimization techniques, and ease of use. Its multi-language architecture ensures optimal performance while maintaining a user-friendly interface, making it suitable for both research and production environments.
By following the best practices and tutorials outlined in this documentation, users can leverage the full power of their hardware to accelerate machine learning workloads, from training large models on high-performance GPUs to deploying optimized models on edge devices.