token streaming is the mode in which the server returns tokens one by one as the models generates them.

  • Users get results earlier for long queries
  • Users can stop generation if the response isn’t going in the direction they wanted
  • More natural, conversational experience
  • Reduces latency and perceived latency
    • End to end latency is the same, but halfway through generation the user will have see half of the results vs. none at all.