This document describes the OpenCL support in the Phynexus framework.
OpenCL (Open Computing Language) is an open standard for cross-platform, parallel programming of diverse accelerators including CPUs, GPUs, FPGAs, and other processors. Phynexus includes OpenCL support to provide hardware acceleration across a wide range of devices and platforms, enabling high-performance computation regardless of the underlying hardware.
The OpenCL backend in Phynexus leverages the OpenCL API to provide a vendor-neutral approach to hardware acceleration, allowing models to run efficiently on devices from various manufacturers including NVIDIA, AMD, Intel, and others. This makes it an excellent choice for deploying AI applications in heterogeneous computing environments.
The OpenCL backend in Phynexus provides:
OpenCL support in Phynexus requires:
Compatible hardware includes: - NVIDIA GPUs with OpenCL support - AMD GPUs and APUs - Intel CPUs, GPUs, and FPGAs - ARM Mali GPUs - Qualcomm Adreno GPUs - Various FPGA platforms - Any other OpenCL-compatible device
# Python
from neurenix.hardware.opencl import is_opencl_available
if is_opencl_available():
print("OpenCL is available")
else:
print("OpenCL is not available")
# Python
from neurenix.hardware.opencl import OpenCLBackend
# Create the backend
try:
opencl = OpenCLBackend()
# Initialize the backend
if opencl.initialize():
print("OpenCL backend initialized successfully")
else:
print("Failed to initialize OpenCL backend")
except RuntimeError as e:
print(f"OpenCL error: {e}")
# Python
from neurenix.hardware.opencl import OpenCLBackend
# Create and initialize the backend
opencl = OpenCLBackend()
opencl.initialize()
# Get the number of available devices
device_count = opencl.get_device_count()
print(f"Available OpenCL devices: {device_count}")
# Get information about a specific device
device_info = opencl.get_device_info(0) # First device
print(f"Device info: {device_info}")
# Python
import neurenix as nx
from neurenix.hardware.opencl import OpenCLBackend
# Create tensors
a = nx.Tensor([[1, 2], [3, 4]])
b = nx.Tensor([[5, 6], [7, 8]])
# Create and initialize the backend
opencl = OpenCLBackend()
opencl.initialize()
# Perform matrix multiplication using OpenCL
c = opencl.matmul(a, b)
print(f"Result: {c}")
# Python
import neurenix as nx
from neurenix.hardware.opencl import OpenCLBackend
# Create input and weight tensors
input = nx.random.randn(1, 3, 32, 32) # Batch, Channels, Height, Width
weight = nx.random.randn(16, 3, 3, 3) # Out channels, In channels, Kernel H, Kernel W
# Create and initialize the backend
opencl = OpenCLBackend()
opencl.initialize()
# Perform 2D convolution using OpenCL
output = opencl.conv2d(
input=input,
weight=weight,
bias=None,
stride=(1, 1),
padding=(1, 1)
)
print(f"Output shape: {output.shape}")
# Python
import neurenix as nx
import numpy as np
from neurenix.hardware.opencl import OpenCLBackend
# Create and initialize the backend
opencl = OpenCLBackend()
opencl.initialize()
# Define a custom OpenCL kernel
kernel_source = """
__kernel void vector_add(__global const float* a, __global const float* b, __global float* c) {
int gid = get_global_id(0);
c[gid] = a[gid] + b[gid];
}
"""
# In a real implementation, you would use the following:
# program = opencl._build_program(opencl._contexts[0], kernel_source)
# kernel = program.create_kernel("vector_add")
#
# # Create buffers
# a_np = np.array([1, 2, 3, 4], dtype=np.float32)
# b_np = np.array([5, 6, 7, 8], dtype=np.float32)
# c_np = np.zeros(4, dtype=np.float32)
#
# # Execute the kernel
# opencl._execute_kernel(kernel, (4,), None, [a_np, b_np, c_np])
#
# print(f"Result: {c_np}")
The OpenCL backend implementation in Phynexus follows a layered architecture:
The implementation uses the OpenCL API to:
The OpenCL backend automatically discovers available platforms and devices:
The OpenCL backend includes automatic fallback to CPU implementations when: - OpenCL is not available on the system - The operation is not supported by OpenCL - An error occurs during OpenCL execution
This ensures that code written to use OpenCL will still work on systems without OpenCL support, albeit with reduced performance.
When multiple OpenCL-compatible devices are available:
# Python
from neurenix.hardware.opencl import OpenCLBackend
# Create the backend
opencl = OpenCLBackend()
opencl.initialize()
# Get the number of available devices
device_count = opencl.get_device_count()
# Get information about all devices
for i in range(device_count):
device_info = opencl.get_device_info(i)
print(f"Device {i}: {device_info}")
# Select the best device based on your requirements
# (In a real implementation, you would choose based on device capabilities)
For optimal performance with OpenCL:
For best performance with custom OpenCL kernels:
# Example of an optimized kernel (conceptual, not actual implementation)
optimized_kernel = """
__kernel void optimized_matmul(
__global const float* a,
__global const float* b,
__global float* c,
const int M, const int N, const int K
) {
// Use local memory for frequently accessed data
__local float a_local[BLOCK_SIZE][BLOCK_SIZE];
__local float b_local[BLOCK_SIZE][BLOCK_SIZE];
// Get global and local IDs
int row = get_global_id(0);
int col = get_global_id(1);
int local_row = get_local_id(0);
int local_col = get_local_id(1);
// Initialize accumulator
float acc = 0.0f;
// Loop over blocks
for (int block = 0; block < K / BLOCK_SIZE; ++block) {
// Load data into local memory
a_local[local_row][local_col] = a[row * K + block * BLOCK_SIZE + local_col];
b_local[local_row][local_col] = b[(block * BLOCK_SIZE + local_row) * N + col];
// Synchronize to ensure all work-items have loaded data
barrier(CLK_LOCAL_MEM_FENCE);
// Compute partial dot product
for (int i = 0; i < BLOCK_SIZE; ++i) {
acc += a_local[local_row][i] * b_local[i][local_col];
}
// Synchronize before loading next block
barrier(CLK_LOCAL_MEM_FENCE);
}
// Store result
c[row * N + col] = acc;
}
"""
| Feature | Neurenix | TensorFlow |
|---|---|---|
| OpenCL Support | Native integration | Limited support via third-party extensions |
| Cross-Platform | Windows, Linux, macOS | Primarily focused on CUDA |
| Vendor Support | Multiple vendors (NVIDIA, AMD, Intel) | Primarily NVIDIA |
| Custom Kernels | Unified API for custom kernels | Complex integration for custom kernels |
| Fallback Mechanism | Automatic CPU fallback | Manual fallback required |
| API Simplicity | Unified API | Separate API for different backends |
Neurenix provides more comprehensive OpenCL support compared to TensorFlow, which primarily focuses on CUDA for GPU acceleration. The unified API in Neurenix makes it easier to use OpenCL acceleration while maintaining compatibility with other backends.
| Feature | Neurenix | PyTorch |
|---|---|---|
| OpenCL Support | Native integration | Limited support via third-party extensions |
| Cross-Platform | Windows, Linux, macOS | Primarily focused on CUDA |
| Vendor Support | Multiple vendors (NVIDIA, AMD, Intel) | Primarily NVIDIA |
| Custom Kernels | Unified API for custom kernels | Complex integration for custom kernels |
| Fallback Mechanism | Automatic CPU fallback | Manual fallback required |
| API Simplicity | Unified API | Separate API for different backends |
PyTorch, like TensorFlow, primarily focuses on CUDA for GPU acceleration, with limited OpenCL support through third-party extensions. Neurenix's native OpenCL support provides a more integrated experience with automatic fallback to CPU when needed.
| Feature | Neurenix | Scikit-Learn |
|---|---|---|
| OpenCL Support | Comprehensive support | No OpenCL support |
| Hardware Acceleration | Multiple acceleration options | CPU only |
| Cross-Platform GPU | Support across platforms | No GPU support |
| Vendor Support | Multiple vendors (NVIDIA, AMD, Intel) | No GPU support |
| Custom Kernels | Support for custom kernels | No kernel support |
| Performance | Optimized for various hardware | Optimized for CPU only |
Scikit-Learn does not provide any OpenCL or GPU acceleration support, focusing solely on CPU execution. Neurenix's OpenCL support enables significant performance improvements on systems with compatible devices.
The current OpenCL implementation has the following limitations:
Future development of the OpenCL backend will include: