A technique used to fit a large model in multiple GPUs.

Each GPU processes a slice of a tensor and only aggregates the full tensor for operations requiring it.