# Ray Data Integration Guide
Integration with Ray Train and ML frameworks.
## Ray Train integration
### Basic training with datasets
```python
import ray
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer
# Create datasets
train_ds = ray.data.read_parquet("s3://data/train/")
val_ds = ray.data.read_parquet("s3://data/val/")
def train_func(config):
# Get dataset shards
train_ds = ray.train.get_dataset_shard("train")
val_ds = ray.train.get_dataset_shard("val")
for epoch in range(config["epochs"]):
# Iterate over batches
for batch in train_ds.iter_batches(batch_size=32):
# Train on batch
pass
# Launch training
trainer = TorchTrainer(
train_func,
train_loop_config={"epochs": 10},
datasets={"train": train_ds, "val": val_ds},
scaling_config=ScalingConfig(num_workers=4, use_gpu=True)
)
result = trainer.fit()
```
## PyTorch integration
### Convert to PyTorch Dataset
```python
# Option 1: to_torch (recommended)
torch_ds = ds.to_torch(
label_column="label",
batch_size=32,
drop_last=True
)
for batch in torch_ds:
inputs = batch["features"]
labels = batch["label"]
# Train model
# Option 2: iter_torch_batches
for batch in ds.iter_torch_batches(batch_size=32):
# batch is dict of tensors
pass
```
## TensorFlow integration
```python
tf_ds = ds.to_tf(
feature_columns=["image", "text"],
label_column="label",
batch_size=32
)
for features, labels in tf_ds:
# Train TensorFlow model
pass
```
## Best practices
1. **Shard datasets in Ray Train** - Automatic with `get_dataset_shard()`
2. **Use streaming** - Don't load entire dataset to memory
3. **Preprocess in Ray Data** - Distribute preprocessing across cluster
4. **Cache preprocessed data** - Write to Parquet, read in training
Source: claude-code-templates (MIT). See About Us for full credits.