token streaming

token streaming is the mode in which the server returns tokens one by one as the models generates them.

Users get results earlier for long queries
Users can stop generation if the response isn’t going in the direction they wanted
More natural, conversational experience
Reduces latency and perceived latency
- End to end latency is the same, but halfway through generation the user will have see half of the results vs. none at all.

Work in Progress