Please post your Web Driver questions in official Web Driver forum

Tuesday, November 3, 2015

JMeter and Load Testing Best Practices

As the saying goes, there are no best practices or probably in context. Following are some guidelines which I followed when working on load testing with JMeter. Some of these guidelines/practices are not limited to only JMeter and can be applied to load testing in general, So let’s begin -

  • Number of Test Runs - The worse thing you can do with load test is to conduct test only once. Given that your test environment would depends on many factors, it is wise to conduct load test more than once to verify consistency of results. If test results have more than 5~10% discrepancy then it is sign of inconsistent system. Figure out the cause of problem first and fix it.

  • State of system - If you are conducting load test consecutively then beware that each consecutive run would leave system in a state which would have memory, db etc resources utilized. Probably your goal is to run each test on a clean state of system.

  • Identify client limitations - Many a times you would encounter that client becomes a bottleneck when there are high number of threads on one client, moreover quicker the application response the more work JMeter has to do to process results. From various sources it is usually recommended that you should not use more than 300 threads on one machine.
While load testing one API I encountered that more than 70 threads on one jmeter client resulted in very high 95 and 99 percentiles response times but when I distributed load from multiple agents each having 70 threads then response times were within acceptable limits.
Hence once you start seeing higher response times, unexpected throughput etc then you should also consider if you are hitting the limit on client side. One way to figure it out is to divide big large load on one test agent to many smaller loads on multiple test agents and gauge how results change. Once you identify the max threads you can have from one test machine then you know how many more machines you need to generate required amount of load

ulimit - Each user has limit on number of open files. This limit is applied to each process run by user. If the limit is 1024 and user has three process running then each process can open total of 3072 files.
To find out soft limit -
ulimit -Sn
1024

To find out hard limit -
    ulimit -Hn
    2048
ulimit -n shows soft limit. Soft limit is the limit applied for opening files. Hard limit is limit you can increase soft limit to.

Increasing to limit to 1080 -
    ulimit -Sn 1080

Changing hard limit -
    ulimit -Sn 4000

ulimit -n 4000 changes both soft limit and hard limit to same value.
Once having set the hard limit, you can not increase it above this value with reboot

If you set soft limit above hard limit then you get error -
    ulimit -Sn 5000
    bash: ulimit: open files: cannot modify limit: Invalid argument

Once you reboot the limit is reset.

To make the limit bigger and to make change permanent edit following config file on ubuntu and reboot -

sudo nano /etc/security/limits.conf
Add lines like these -
<username> soft nofile 4000
<username> hard nofile 5000
You can use * in the limit.conf file instead of a user name to specify all users, This does not apply to the root -

* soft nofile 20000
* hard nofile 30000

            This is how it looks on my system -

                # End of file
* soft nofile 32768
* hard nofile 32768
root soft nofile 32768
root hard nofile 32768

            Don’t forget to reboot ;-)

    Finding the number of open files on ubuntu -

        gerp the process >
            ps aux | grep jmeter

        Let’s assume process id is 12345
        Now you can see open files using lsof command -
        lsof -p 12345
       
and count the number of open files by counting the number of lines output by lsof command -
    lsof -p 12345 | wc -l


  • Assertions - Like manual testing, it is important to find out if a web page or API response is right under load test. A plain 200 response code does not guarantee that page or API response is how it is supposed to be. Such check can be achieved by adding assertions to sample response. Example JMeter assertion - Response Assertion, Duration Assertion etc.
The result of assertion can be seen in Assertion Result Listener.


Even if you don’t add Assertion Result Listener, a failed assertion would always be reported in test results.
Response Assertion and the Duration Assertion are more or less safe to use, whereas the Compare Assertion and other XML-based ones like XPath Assertion take up the most CPU and memory.

  • Wait Period (aka sleep time) - If you have worked UI test automation tools then you know how much static waits are hated. But when it comes to load test then wait period is recommended to be used between transactions. This is the time to give pause between subsequent sample requests. Wait period is required to emulate real user behavior since real user does not hit one request after another, after another etc but pauses for “some” duration before continuing with next request. You can use constant timer with JMeter to achieve this. Like other JMeter elements, timer can be added at the test plan level (which would then be applicable to all http requests) or specific samplers to add different constant timer for each sampler. There is another timer available in JMeter, known as Uniform Random Timer. This can be used to generate random time pause.

  • Retrieving embedded resources - A web page is made of many components, there are css files, images, js files etc. You can instruct JMeter to download the resources during load test. HTTP Sampler > “Retrieve All Embedded Resources” - Check this checkbox to make JMeter download javascript, css and images just as real browser would do, also set
Use thread/connection pool to simulate the browser parallel fetching (use between 2-4 threads). In addition, for every one of these threads simulating a user, JMeter creates separate thread pools of given pool size with thread names like pool-n-thread-m. The main page is downloaded by the user’s thread "Thread Group 1-k" while the embedded resources are downloaded by its associated thread pool with thread names like pool-n-thread-m. when setting the concurrent pool size, keep in mind the number of users being simulated, because a separate thread pool is created for each of these simulated users. If there are many users, too many threads may get created and start affecting the response times adversely due to bandwidth contention at the JMeter side. If many users are to be simulated, it’s recommended to distribute JMeter testing to multiple machines.

Screenshot from 2015-10-26 16:31:09.png

   
When retrieving embedded resources then make sure you exclude external domain from download using “URLs must match” field. You don’t want to load test external URL you don’t have control on
ex - Add the following RegEx to the edit box named Embedded URLs must match to exclude external domains :
^((?!<domain #1>|<domain #2>|<domain #3><domain #4>|<domain #5>).)*$
E.g.
^((?!google|facebook|pinterest|twimg|doubleclick).)*$
Set  “Retrieve All Embedded Resources” in “HTTP Request Default” if resources are to be retrieved for all the http samplers
To download all resources from web site - https://www-de.test.appdoamin.net/ use following URL pattern - .*test\.appdoamin\.net.*
    more on this topic -

  • Listeners - Listeners are the way to monitor test results in JMeter either during test execution or once test execution is over. Listeners receive Sample Results and do some processing with it, this takes resources (memory, CPU). Hence during Load Testing do not use any listener.  Run your test in non GUI mode and save the test result for future analysis. GUI mode of JMeter with Listeners should only be used for debugging and making sure your test script works before going for full blown test.

  • Out of memory Exception
    - If you encounter Out of memory Exception during test execution then you need to make more memory available to JMeter during test execution. There is a line in jmeter.bat (on windows) or jmeter.sh (on linux) script which launches JMeter with following default value -

JVM_ARGS="-Xms512m -Xmx512m" jmeter.sh

You can try increasing maximum heap until you'll stop receiving Out of memory errors. You could set it to somewhere like 70% of your hardware RAM.

JVM_ARGS="-Xms2G -Xmx4G" jmeter.sh


The flag Xmx specifies the maximum memory allocation pool for a Java Virtual Machine (JVM), while Xms specifies the initial memory allocation pool.
This means that your JVM will be started with Xms amount of memory and will be able to use a maximum of Xmx amount of memory. For example, starting a JVM like below will start it with 256MB of memory, and will allow the process to use up to 2048MB of memory:
java -Xmx2048m -Xms256m
The memory flag can also be specified in multiple sizes, such as kilobytes, megabytes, and so on.
-Xmx1024k
-Xmx512m
-Xmx8g

  • Viewing JMeter results summary in non GUI mode - When running test in non GUI mode, you would want to see a snapshot of test run, like number of threads, response time etc information. To do this just uncomment the following within the JMeter properties file (remove the ‘#’):

#summariser.name=summary
#summariser.interval=180
#summariser.out=true

    These parameters are already enabled on latest version of JMeter :-)
    The summary report looks as -

Screenshot from 2015-11-03 16:26:49.png


  • Distributed (Remote) Testing -
Once you reach the limits of one machine, you can switch to distributed or remote
testing. But JMeter defaults are not fine for efficient remote testing, so in user.properties, add:
mode=StrippedBatch
This will:
  • remove some data from the SampleResults as the response body, as you don’t need heavy response body when running heavy load test
  • will send Sample Results as Batches and not for every sample reducing CPU, IO and network roundtrips

  • Use CSV as output for Save Service - XML is verbose, it takes cpu and memory resources for writing and analysis, CSV is great as utilizes least resources. Hence save your test output in csv file. When running test from command line, you can add following parameter to make jtl file to be in csv format -


-Jjmeter.save.saveservice.output_format=csv

    Furthermore, for massive load tests there are many result data you don't need.

So, in user.properties, add:

jmeter.save.saveservice.output_format=csv

jmeter.save.saveservice.data_type=false

jmeter.save.saveservice.label=true

jmeter.save.saveservice.response_code=true

jmeter.save.saveservice.response_data.on_error=false

jmeter.save.saveservice.response_message=false

jmeter.save.saveservice.successful=true

jmeter.save.saveservice.thread_name=true

jmeter.save.saveservice.time=true

jmeter.save.saveservice.subresults=false

jmeter.save.saveservice.assertions=false

jmeter.save.saveservice.latency=true

jmeter.save.saveservice.bytes=true

jmeter.save.saveservice.hostname=true

jmeter.save.saveservice.thread_counts=true

jmeter.save.saveservice.sample_count=true

jmeter.save.saveservice.response_message=false

jmeter.save.saveservice.assertion_results_failure_message=false

jmeter.save.saveservice.timestamp_format=HH:mm:ss

jmeter.save.saveservice.default_delimiter=;

jmeter.save.saveservice.print_field_names=true


You can also print variable, parameters used during test in csv file. For example if you use variable / parameters - counter, accessToken in your test plan then you can print them in csv file using following from command line -

-Jsample_variables=counter,accessToken



  • Using regx expression extractor - Use Regular Expression Extractor for extracting data BUT never ever check Body (unescaped), choose among:
Body
Headers
URL
Response Code
Response Message

Use efficient Regular expressions and extract as less data as possible

  • Threadgroup name for distributed testing - For distributed testing use thread group name as -  
${__machineName()}_My Threadgroup name
This would identify thread group name exclusively for a machine

  • Use cache manager to simulate browser cache

  • Use cookie manager to simulate browser cookie

  • By default, JMeter does not save threads count in JTL files. If you plan to work with JMeter JTL files, you should enable it by uncommenting in JMETER-INSTALL-DIR/bin/jmeter.properties the line and set it to true:
#jmeter.save.saveservice.thread_counts=true

  • To simulate browser, add user agent string in Header Manager. You can copy it from your browser -
   
It does not matter where you place header manager, headers in request will be same -

  • Sharing variables between threads and thread groups -
Variables are local to a thread; a variable set in one thread cannot be read in
another. This is by design. For variables that can be determined before a test starts,
see Parameterising Tests (above). If the value is not known until the test starts, there
are various options:
  • Store the variable as a property - properties are global to the JMeter instance hence they can be used in different thread groups, unlike variable which are local to a thread group.

  • When testing web application login as one user manually and navigate to screen under test, you may find application bugs


  • Application logs-


During manual testing, analysing application logs helps to uncover application defect which would otherwise be missed. This is equally true with load test. You may find errors on connection timeout, application operation failures ec which would add more value to load test report. So don’t forget to scan application logs when carrying out load test :-)

  • Check for stickiness on AWS load balancer -

Screenshot from 2016-01-05 15:11:26.png

   
        Stickiness should be disabled else you traffic would end up on one instance


  • No really a best practice but remember you can not add port number in Server Name / IP field of sampler. You may easily forget this if you use variable for URL and specify port number there. You must mention port number only in port number field. You can also specify port number in HTTP Request Default element.

  • Which HTTP Request Implementation to use?

        Screenshot from 2016-01-06 16:24:36.png

There are some limitation on using java implementation as described here
Hence use HttpClient4 implementation, which employs Apache HttpComponents HttpClient 4.x

  • Besides the snazzy iftop / top etc command (which is topic for another post) you can also read n/w read, write and other operations from AWS console
       
Screenshot from 2016-01-07 11:27:33.png

Notice that if it EC2 instance is EBS mounted then you would see Read and Write operations on corresponding mounted EBS -

Screenshot from 2016-01-07 13:12:54.png

In the similar manner you can also monitor 4XX, 5XX errors from load balancer monitoring. This comes handy when you encounter 504 or others errors on client end but don’t see any error in application log. And then you know load balancer is the error generator.

Screenshot from 2016-01-13 14:52:07.png

  • Are you filling the Surge Queue Length. According to AWS

Surge Queue Length > The total number of requests that are pending routing. The load balancer queues a request if it is unable to establish a connection with a healthy instance in order to route the request. The maximum size of the queue is 1,024. Additional requests are rejected when the queue is full. For more information, see SpilloverCount.
Reporting criteria: There is a nonzero value.
Statistics: The most useful statistic is max, because it represents the peak of queued requests. The average statistic can be useful in combination withmin and max to determine the range of queued requests. Note that sum is not useful.
Example: Suppose that your load balancer has us-west-2a and us-west-2b enabled, and that instances in us-west-2a are experiencing high latency and are slow to respond to requests. As a result, the surge queue for the load balancer nodes in us-west-2a fills, with clients likely experiencing increased response times. If this continues, the load balancer will likely have spillovers (see the SpilloverCount metric). If us-west-2b continues to respond normally, the max for the load balancer will be the same as the max for us-west-2a.

You can observe surge queue length in cloud watch metrics.

You can filter per LB Metrics and per LB > per AZ Metrics-
Screenshot from 2016-01-28 12:12:45.png

Screenshot from 2016-01-28 12:13:20.png

and this is how your graph looks like -

Screenshot from 2016-01-28 12:09:19.png


  • If you encounter high average latency on ELB like this -
        Screenshot from 2016-02-08 13:03:16.png

    Then it is time to troubleshoot ELB latency issue




  • Why does throughput on aws instance drops suddenly?
You would have seen such behavior when conducting load test on instances deployed Due to IO credits and Burst Performance AWS-EBS shows higher throughput for certain period and then goes down to baseline performance. If you encounter issue of drastic drop in application throughput but other performance criteria do not show degradation then it is time to find out what IOPS you need and what is supported from AWS infra you have. Here is one case study I came across when conducting load test for one of the project -

Test Environment -

    • 2 c4.8xlarge instances
    • 2 ESB volume storage, 120 GB each
    • 10 JMeter m4.4xlarge test clients (repeated tests with m4.10xlarge test instances which is 10Gigabit n/w instance but results were same)

Considering EBS-EC2 doc, I was on 10 Gigabit n/w instance since I was using c4.8xlarge instance. Hence I assume that Max bandwidth is limited to 500MB/s.
And considering EBSVolumeTypes I was limited to maximum throughout of 160MiB/s

Throughput during test -
  • Read Request - 12000/sec
  • Write Request - 300/sec

Taking Avg Read and Write size into consideration -

  • Read Bandwidth - 12000 * 30 (avg read size of ESB from aws console in KiB/op)  = 360000KB = 360MB
  • Write Bandwidth - 300 * 60 (avg write size of ESB from aws console in KiB/op) = 18000KB = 18MB

Which is 378 MB.

I suppose that I was able to reach throughput of more than 160 MB owing to IO credits and Burst Performance

But credit balance runs out in some time during test and performance comes down to baseline performances. This is when throughput drops from 12000 req/sec to 6000 req/sec. I have repeated the test on different days and different times but results have been same. Throughput drops dramatically (infact as low as 4000/sec) and continues to be there for about 30 mins of test run.

TransactionsPerSecond.png

Other performance metrics i.e. cpu, load average were considerably low for entire duration of load test.

Given this I don't foresee any other reason than n/w limitation for drop in application throughput.

You could also scrutinize n/w limitation by analyzing sent and receive bytes from ELB access log. You can also enable access log following this document -

  • Do you have sufficient number of connection pools to support the threads with which you run the test?


  • If you have tons of components then did you test them individually to isolate slow/erroneous components?


References -
Fork me on GitHub