token streaming is the mode in which the server returns tokens one by one as the models generates them.
- Users get results earlier for long queries
- Users can stop generation if the response isn’t going in the direction they wanted
- More natural, conversational experience
- Reduces latency and perceived latency
- End to end latency is the same, but halfway through generation the user will have see half of the results vs. none at all.