At the moment, each application of the improved staggered Dslash operator involves the communication of a quark-field boundary layer 3 lattice-sites wide between GPUs operating on neighboring lattice domains. To date, linear solves using half-precision outside of the preconditioner have proven to be unstable, while solvers that use mixed single-double precision work. However, perhaps we could use half precision instead of single precision in the Dslash communication routines. We could even use a mixed-precision communication routine, where 1/3 of the boundary-layer data is communicated in single-precision and the remainder - the data needed for the three-hop term - is communicated in half-precision. This is motivated by the fact that the three-hop coefficient is numerically very small and omitting this term in the preconditioner has a negligible effect on solver convergence. Similarly, we might consider using reconstruct 9 for the long links in the single-precision exterior Dslash kernels.