I have posted a question on stackoverflow about sudden reduction in application throughout. I also follows up on this conversation with AWS support and Following is the gist of conversation with AWS Support -
tl dr:
AWS
instances my team was using was not set with enhanced networking
capabilities to be able to get max n/w performance (i.e. 10 gigabits on
c4.8xlarge instance). For example ixgbevf on test-aws-am was not set to
2.14.2.
Long Version:
The test in question is a static html page. GET request, no complicated logic, No EBS, DJ-CTS etc.
- What n/w capabilities should we experience when instance does not satisfy enhanced networking capabilities?AWS Support: We really do not have specific numbers because that varies... by a lot of factors (time of day, other instances sharing network in same location.. and several other factors... What we do, is to advice customers to do benchmarking tests to confirm that the instances meet the performance expectations... for the applications
- During the test 18430 KiloBytes/sec data is transferred which is way under the limit of 10 gigabits n/w.
AWS support has been insistent on probability of throughput throttle being an application issue.AWS Support: Through our testing we have eliminated ELB as a potential bottleneck for the drop in throughput and we know that the issue is occurring at the back-end instances. Looking at the back-end instances, the general performance metrics such as CPU utilization, Network In, Network Out etc for both the instances looks good. This indicates that the issue we are facing could be an application issue. could you please add another instance and try the test again?I can not confirm on this since there have been zero errors during multiple iterations of test runs. Once tests hit the lower limit on throughput then any subsequent test run shows the results in the lower range of throughout (about 8000 requests/sec) but if I wait for few hours (about 2 to 3 hrs) and run test then it is back to same behavior that is higher throughput for about half hour and then back to 8000 requests/sec. I excluded the possibilities of adding another instances as testing without ELB (described below) exhibits same behavior.
- ELB Prewarm did not help on addressing reduced throughput. Moreover warm up is time bound and it scales down after some time. The warmup rather showed the more skewed results and throughput dropped in 5 minutes than without it -
- Different instance type for test agent from m4.4xlarge to m4.10xlarge and c4.large have shown same behaviour of throttle in application throughput. Test runs with ramp up period of increasing threads hold same behavior barring that it reaches the higher limit after greater amount of time attributed to the ramp up period and not half an hour
- Testing without ELB exhibits same behavior except that the high throughput would be in the range of 8000 requests/sec and then it drops to about 4000 requests/sec. Despite tests use custom DNS resolver; this further excludes any anomaly caused by probable DNS caching of ELBs.
Mystery of drop in application throughput was never solved.
Later on I got to know from in house n/w engineers that it is not only enhanced n/w capabilities but also placement group which is required to be able to use 10 gigabit n/w performances. Unfortunately placement group is limited to having instances in same availability zone. Is not it risky?
Later on I got to know from in house n/w engineers that it is not only enhanced n/w capabilities but also placement group which is required to be able to use 10 gigabit n/w performances. Unfortunately placement group is limited to having instances in same availability zone. Is not it risky?