reduced overhead and non-blocking transfers for collated and masterOnly file handling
The aim of these changes is to avoid duplicate copies of character data and improved communication throughput.
The buffering in the output stream are now based on OCharStream instead of OStringStream. This allows full recovery of the streamed characters without additional copies. The character data are "yielded" from the streaming buffer to pass on to the backend writers without an intermediate copy into string and copy back out of a string. The full buffer, including unused portions, is transferred to avoid triggering any alloc/free at that point. The char data can then be directly communicated (non-blocking) to the output.
On the receiving end, the size of the character content can be established directly from an MPI_Probe prior to setting up the MPI_Recv/MPI_Irecv. This avoids both the memory overhead of PstreamBuffers (the previous implementation) as well as needing to coordinate between all ranks (the PstreamBuffers has a synchronization point when establishing the buffer sizes as part of the PEX algorithm).
When using a master-only writing (non-collated), now use polling dispatch to write file content when it becomes available.