This document describes the NVIDIA TensorRT support in the Phynexus framework.
NVIDIA TensorRT is a high-performance deep learning inference optimizer and runtime that delivers low latency and high throughput for deep learning inference applications. Phynexus integrates with TensorRT to provide accelerated inference on NVIDIA GPUs, enabling significant performance improvements for deployed models.
The TensorRT backend in Phynexus leverages NVIDIA's optimized runtime to apply optimizations such as layer fusion, precision calibration, kernel auto-tuning, and dynamic tensor memory management. These optimizations allow models to run efficiently on NVIDIA GPUs, making it ideal for production deployment of deep learning models.
The TensorRT backend in Phynexus provides:
TensorRT support in Phynexus requires:
Compatible hardware includes: - NVIDIA GeForce RTX series - NVIDIA Tesla series - NVIDIA Quadro series - NVIDIA A100, A10, A30, A40 - NVIDIA T4, V100, P100
# Python
from neurenix.hardware.tensorrt import is_tensorrt_available
if is_tensorrt_available():
print("TensorRT is available")
else:
print("TensorRT is not available")
# Python
from neurenix.hardware.tensorrt import TensorRTBackend
# Create the backend
try:
tensorrt = TensorRTBackend()
# Initialize the backend
if tensorrt.initialize():
print("TensorRT backend initialized successfully")
else:
print("Failed to initialize TensorRT backend")
except RuntimeError as e:
print(f"TensorRT error: {e}")
# Python
from neurenix.hardware.tensorrt import TensorRTBackend
# Create and initialize the backend
tensorrt = TensorRTBackend()
tensorrt.initialize()
# Get the number of available devices
device_count = tensorrt.get_device_count()
print(f"Available TensorRT devices: {device_count}")
# Get information about a specific device
device_info = tensorrt.get_device_info(0) # First device
print(f"Device info: {device_info}")
# Python
import neurenix as nx
from neurenix.hardware.tensorrt import TensorRTBackend
from neurenix.nn import Sequential, Conv2d, ReLU, Linear
# Create a model
model = Sequential(
Conv2d(3, 16, kernel_size=3, stride=1, padding=1),
ReLU(),
Conv2d(16, 32, kernel_size=3, stride=2, padding=1),
ReLU(),
Conv2d(32, 64, kernel_size=3, stride=2, padding=1),
ReLU(),
Linear(64 * 8 * 8, 10)
)
# Create and initialize the backend
tensorrt = TensorRTBackend()
tensorrt.initialize()
# Define input shapes
input_shapes = {"input": (1, 3, 32, 32)}
# Optimize the model with TensorRT
optimized_model = tensorrt.optimize_model(
model=model,
input_shapes=input_shapes,
precision="fp16",
workspace_size=1 << 30 # 1 GB
)
print("Model optimized with TensorRT")
# Python
import neurenix as nx
from neurenix.hardware.tensorrt import TensorRTBackend
# Create input tensor
input_tensor = nx.random.randn(1, 3, 32, 32)
# Create and initialize the backend
tensorrt = TensorRTBackend()
tensorrt.initialize()
# Run inference using TensorRT
outputs = tensorrt.inference(
model=optimized_model,
inputs={"input": input_tensor}
)
print(f"Output shape: {outputs['output'].shape}")
The TensorRT backend implementation in Phynexus follows a layered architecture:
The implementation uses the TensorRT API to:
The TensorRT backend manages the following components:
The implementation supports multiple precision modes:
Choose the appropriate precision for your workload:
# Python
from neurenix.hardware.tensorrt import TensorRTBackend
# Create and initialize the backend
tensorrt = TensorRTBackend()
tensorrt.initialize()
# For maximum performance with acceptable accuracy loss
optimized_model = tensorrt.optimize_model(
model=model,
input_shapes=input_shapes,
precision="fp16" # Use FP16 for good balance of performance and accuracy
)
# For maximum performance where accuracy is less critical
optimized_model = tensorrt.optimize_model(
model=model,
input_shapes=input_shapes,
precision="int8" # Use INT8 for maximum performance
)
# For maximum accuracy
optimized_model = tensorrt.optimize_model(
model=model,
input_shapes=input_shapes,
precision="fp32" # Use FP32 for maximum accuracy
)
Allocate sufficient workspace for TensorRT:
# Python
from neurenix.hardware.tensorrt import TensorRTBackend
# Create and initialize the backend
tensorrt = TensorRTBackend()
tensorrt.initialize()
# Allocate a large workspace for better optimization
optimized_model = tensorrt.optimize_model(
model=model,
input_shapes=input_shapes,
workspace_size=2 << 30 # 2 GB
)
For best performance, use fixed input shapes:
# Python
from neurenix.hardware.tensorrt import TensorRTBackend
# Create and initialize the backend
tensorrt = TensorRTBackend()
tensorrt.initialize()
# Specify exact input shapes for optimal performance
input_shapes = {"input": (1, 3, 224, 224)} # Fixed batch size and dimensions
optimized_model = tensorrt.optimize_model(
model=model,
input_shapes=input_shapes
)
| Feature | Neurenix | TensorFlow |
|---|---|---|
| TensorRT Integration | Native integration | Via TF-TRT plugin |
| Optimization Process | Seamless | Requires manual configuration |
| Precision Control | Simple API | Complex configuration |
| Model Compatibility | High compatibility | Limited compatibility |
| API Simplicity | Unified API | Separate API for TensorRT |
| Integration with Framework | Fully integrated | Plugin-based integration |
Neurenix provides more seamless integration with TensorRT compared to TensorFlow, which requires a separate TF-TRT plugin. The unified API in Neurenix makes it easier to optimize models with TensorRT while maintaining compatibility with other backends.
| Feature | Neurenix | PyTorch |
|---|---|---|
| TensorRT Integration | Native integration | Via torch-tensorrt |
| Optimization Process | Seamless | Requires manual configuration |
| Precision Control | Simple API | Complex configuration |
| Model Compatibility | High compatibility | Limited compatibility |
| API Simplicity | Unified API | Separate API for TensorRT |
| Integration with Framework | Fully integrated | Plugin-based integration |
PyTorch requires the torch-tensorrt package for TensorRT integration, which introduces a separate API and workflow. Neurenix's native TensorRT support provides a more integrated experience with simpler model optimization and inference.
| Feature | Neurenix | Scikit-Learn |
|---|---|---|
| TensorRT Support | Comprehensive support | No TensorRT support |
| GPU Acceleration | Native GPU support | Limited GPU support |
| Model Optimization | Automatic optimization | No deep learning optimization |
| Inference Performance | Optimized for inference | Not optimized for deep learning |
| Precision Control | Multiple precision options | No precision control |
| Deep Learning Support | Comprehensive | Limited |
Scikit-Learn does not provide any TensorRT integration, focusing on traditional machine learning algorithms rather than deep learning. Neurenix's TensorRT support enables significant performance improvements for deep learning inference on NVIDIA GPUs.
The current TensorRT implementation has the following limitations:
Future development of the TensorRT backend will include: