There is a limited number of messages that can execute within one round so a canister call to another canister on the same subnet can still unpredictably require a consensus and networking round in between the two canisters’ message-executions. This effect only gets bigger as the subnet’s load grows and canisters compete for round space. The stats you mentioned are only possible to get predictably on a rented subnet only executing a limited number of messages at a time. On the public subnets where anyone can create canisters, it’s not possible to predictably get those numbers.
Another factor here is that the end-to-end ingress message end-user latency difference between calling a canister that calls another canister on the same subnet and calling a canister that calls another canister on a different subnet is much less of a difference (maybe around 3x max?).
For the plan in the original post, we can say that every canister call takes up to the time of a cross-subnet call and sometimes it can be faster if the protocol is able to make it faster (if both canisters happen to be on the same subnet at the time of the call) but canister authors should have in mind that a canister call can always take up to the time of a cross subnet call. The cycles cost of a canister call can be made the same whether or not the canisters are on the same subnet, the cost can be set by looking at the average resource-usage of the canister calls throughout the whole network.