

IdeaForest.com Stress Test Report

Revision 1
August 27, 2000


David DeBry (grue) -- grue@thrownclear.com
Thrown Clear Productions


----------------------------------------


Table of Contents

Deluge Overview
    The Configuration File
    Basic Attack Flow
    Playback Attacks
    Exceptions to Basic Flow
    Tuning

Metric Issues

Attack and Result Data
    User Types
    Attacks
        Registration Stress
        Street Fair Stress
        Cross Section
        No Forms
Suggestions for Future Testing


----------------------------------------


Deluge Overview


    The data that any system spews out is generally useless unless you
understand how it was generated.  As such, here is an explanation of what
the stress test software ("Deluge") does, and how it works.

    There are three parts to Deluge:

    1] dlg_proxy, the recording proxy server.  This is a proxy web server
that intercepts all requests made from a standard web browser before
sending the requests on to the web site.  This proxy server doesn't
change the request before sending it on; it records it into a file that
the attack program can use.

    2] dlg_attack, the attack program.  This is the part of the system
that does most of the work: simulating users, sending requests to the web
server, timing and examining responses, and writing the results to a log
file.

    3] dlg_eval, the log evaluator.  A post process to the actual attack,
the log evaluator reads the data that the attack program dumps out, and
returns statistics describing how well the test site responded to the
attack.


The Configuration File

    The three programs are connected by means of a configuration file.
The configuration file is written by the person running Deluge, and is
used to control aspects of all three programs.  The same configuration
file is used to run all three programs (in most cases).

    dlg_attack simulates many instances of multiple user types, as
defined in the configuration file.  For instance, you could create a user
type that started at the home page of a site, wandered around randomly
for five pages, and then left.  You also might want one hundred of these
users running.  Your configuration file might look like this:


timeout = 30                     Time in seconds to wait for the server
log_filename = wander_100.log    File to put all the log info into
threads_per_proc = 10            Max simultaneous requests in queue
queue_max_delay = 0.75           Max time in seconds before queue is run
attack_time_length = 1800        Time in seconds the entire test should run
user_ramp_time = 180             Time in seconds to ramp in the users

user_def = wander_robot                   Name of user type
    attack_type = wander                  Type of attack (this is a robot)
    top_url = http://www.ideaforest.com/  First page all users start on
    limit_pages_traversed = 5             Pages to read before complete
    instances = 100                       How many of this user to make
    delay_time = 120                      Time between pages in seconds
    delay_spread = 10                     Deviation +/- seconds from
                                              delay_time
    restartable = 1                       Users restart after completion
    restart_time = 120                    Time before restarting in seconds
    restart_spread = 10                   Deviation +/- seconds from
                                              restart_time
    get_images = 1                        Users are allowed to request images
    accept_cookies = 1                    Users are allowed to accept cookies
END


    The indented user_def section is where the user type is defined.
There are some extra controls set there: how long a user might take to
read a page, each user restarts when its tasks are completed (instead of
just dying), image and cookie controls, and so on.  There are many other
possible parameters to control a user, and these are listed in the
documentation for Deluge.

    The area ahead of the user_def section is the area for the global
controls.  These will be mentioned as dlg_attack's functionality is
walked through.


Basic Attack Flow

    When started, dlg_attack reads the configuration file, and creates
100 simulated users ("SU") with the parameters listed in the user_def
section.  (Note that you can have as many user_def sections as you need
to define all your user types.  In the most complex attacks for the Idea
Forest launch, we had 11 different user_def sections in the configuration
files.)

    Each SU has a bunch of internal information, most of which is reset
when an SU completes its tasks and restarts.  This information includes:

    - If cookies are turned on, a cookie jar for storing the cookies it
      receives.
    - A simulation of the cache a browser maintains, so an SU doesn't
      request certain items over and over from the server.
    - An internal timer to wake the SU up when it's time to do something.

    As dlg_attack creates each SU, it sets the SU's internal timer to
some value less than 180 seconds (as defined by user_ramp_time).  What
this does is make it so the SUs all start at different times.  That way,
both the attack machines and the web servers avoid getting hit with a
worst case scenario of many people all hitting the same button at the
same time.  (If you wanted to simulate this situation, you could just set
user_ramp_time to 0.)

    dlg_attack then starts the attack.  It waits until a SU's internal
timer goes off, and then lets that SU do what it wants to.  In this case,
the SU is going to want to request the home page for Idea Forest.
Instead of sending this request directly to the server, the SU submits it
back to dlg_attack.  The SU then sets its internal timer to go off in the
future.  The timer is set at 120 seconds (delay_time), plus some number
between -10 and 10 (delay_spread).  It then returns control to dlg_attack
so that other SUs get a chance to work.

    dlg_attack maintains a queue of requests from the SUs.  It's more
efficient for the requests to be sent in batches than individually, so
dlg_attack saves up the requests until one of two things happen:

    - The number of requests in the queue reaches 10 (threads_per_proc)
    - The time in seconds since the queue had its first request put into
      it is greater than 0.75 seconds (queue_max_delay)

    The queued requests are sent to the server simultaneously.  As the
response to each request is received, the data is sent to the SU that
requested it for processing.  Processing includes:

    - Logging how long it took for the request to come back
    - Logging how big (in bytes) the response is
    - Log any errors that occurred during the request or response
    - If the SU is a robot, and if the response is some sort of language
      (HTML or otherwise), parse it for links to images or other pages.
    - Logging if the SU had to wait to run after its internal timer went
      off, and if so, how long.  This happens when dlg_attack is busy with
      another SU (or SUs).  This is an important number to the person
      running dlg_attack.  We'll call it the SU pause time.

    To further simulate how a real person with a real browser acts,
images are requested for a given page as soon as that page is retrieved
from the server (if that SU is allowed to request images).  In other
words, the SU doesn't wait for its internal timer to go off before
requesting the images.  Like other requests, the image requests are sent
through dlg_attack's request queue.

    This robotic SU also needs to know what page to load next time it
gets to read a page, so it picks one from the list of links on the
current page (it found these links during processing), and remembers it
for loading next time.

    After the SU has run for a while, it's going to complete its tasks.
In the case of our "wander_robot" user type, once it's requested 5
different pages (limit_pages_traversed), all with images, it's done.
Normally, the SU would just die at this point, but it's been told to
restart itself in this case (restartable).  So, the SU clears out all its
internal information, sets its internal timer to be 120 (restart_time)
plus some number between -10 and 10 (restart_spread), and returns control
to dlg_attack.

    Since all the SUs keep restarting, it looks like the program will
never end.  Sometimes, this is what is wanted: let the program run until
it's killed by the person running it.  In this case, however, a limit of
1800 seconds (attack_time_length) has been set.  Once dlg_attack has run
for that long, all the SUs are stopped and the program ends.


Playback Attacks

    Often simple randomly wandering robots aren't enough for a good
attack.  The SUs need to be able to use the search system, fill out forms
for registration and ordering, create an ad in the Street Fair, and so
forth.  For this, we use the recording proxy server to create a playback
attack.

    As described earlier, the proxy server records all the actions of a
real person on the site, as well as all the responses that the server
sends to those actions.  These recordings can be used in an attack.  The
user definition would look similar to this:


user_def = simple_playback         Name of user type
    attack_type = playback         Type of attack (this is NOT a robot)
    script_dir = simple            Directory holding script and responses
    script_file = _pb_             Filename of playback script
    instances = 100                How many of this user to make
    delay_time = 120               Time between pages in seconds
    delay_spread = 10              Deviation +/- seconds from delay_time
    restartable = 1                Users restart after completion
    restart_time = 120             Time before restarting in seconds
    restart_spread = 10            Deviation +/- seconds from restart_time
    accept_cookies = 1             Users are allowed to accept cookies
END


    Without going into too much detail, there are also methods for
slightly varying the playback depending on which SU is doing it.  That
way, if you have one hundred SUs running the same script (as above), and
if those SUs were (for example) registering new accounts, they wouldn't
all have the same name.  This is called variable replacement and script
hacking, and it has a large section of the documentation dedicated to it.

    One eccentricity of playback attacks that needs to be addressed is
secure transactions.  The recording proxy is incapable of recording
request/response pairs that are encrypted.  This is because by the time
the proxy sees the data, it's already been encrypted.  The proxy can't
decrypt it, because that's the whole point of security.

    The work around is to do all recording while the servers have secure
encryption turned off.  Then, using another aspect of script hacking,
whichever requests need it can be turned into secure requests.


Exceptions to Basic Flow

    Unfortunately, there are certain conditions that can cause the basic
flow of dlg_attack to get trickier.

    One of these is a secure request, either from a robotic or a playback
attack.  The library that dlg_attack uses to send a batch of simultaneous
requests to the server is currently incapable of parallelizing secure
requests.  So, all secure requests have to be run singly and immediately,
ignoring the request queue.  This has the effect of slowing down the
attack; but then so does encrypting the request and decrypting the
response.

    Another issue involves pages which forward or redirect to other
pages.  Sometimes sites are designed so that requesting certain pages
will return a small page which forwards the user on to a different page,
without the user seeing anything.  From Deluge's point of view, however,
this will appear to be two separate pages.


Tuning

     Many of the parameters in the examples above seem somewhat
arbitrary.  Some of these are based on what is expected from a person in
the Idea Forest demographic (spending two minutes reading a page, for
example).

    However, others are less obvious.  threads_per_proc, queue_max_delay,
timeout, instances, and user_ramp_time are all parameters that can have
significant impact on the way dlg_attack affects the server.  Worse yet,
they will also affect the meaning and results of the timings reported in
the log files, and eventually in the statistics reported by the log
evaluator.  All of this is limited by the power of the attack machines,
the web servers, and the network connecting them.

    At the beginning of stress testing a site, it's important that the
person administering the attack find out the best values for these
crucial parameters.  This tuning phase is done primarily by
experimentation.  The most important thing is to not have too many SUs
running on a single attack machine.  If a machine is incapable of
handling the number of SUs it's given, either the statistics it reports
will be invalid, or the load it generates won't be what is expected, or
both.

    Because of the way SUs have timers counting down to their next
action, and how the requests from the SUs are queued by dlg_attack,
overloading a single attack machine with too many SUs will not affect
individual page response time measurements.  (This is the reason the
queueing system was designed the way it was.)  However, what will happen
is the load on the servers will not be what you think it is.

    For example, if an attack machine running at full capacity consisted
of the following:

    - 100 SUs
    - Each SU requests a page approximately every 120 seconds
    - There are about 10 new (uncached) images per page on average
    - No more than 10 simultaneous requests (queue_max_length)

    Also, say that under this load, the web server returned full pages
(with images) in 2.3 seconds per page, on average.

    Further, say that the attacker ran an attack with all the same
parameters as above, except that the attacker instead, used 200 SUs.

    In this case, the reported results would say that the server was
still returning pages in 2.3 seconds.  However, the CPU load on the
server and the network load would be the same as if the test ran with 100
SUs.

    This is because the attack machine can only attack as fast as it is
physically capable.  Internally, SUs' internal timers would be going off,
but the CPU would be busy with other SUs.  SUs who wanted to do something
would start getting backed up.  In other words, the SU pause time would
go higher and higher as the test progressed.

    The SU pause time is stored in the log file.  In the tuning phase,
the SU pause time is one of the best indicators that your attack machines
are not up to the severity of the attack that you're trying to generate.

    Experimentation with the other crucial parameters is also very
important.  timeout and queue_max_length both affect each other, and both
are bound by the power of the attacking machine.  queue_max_length in
particular has the ability to affect the reported results very
significantly.  In a test of the effect of queue_max_length, these were
the results:

    queue_max_length    Average page response time
    10                  2.3 seconds
    40                  2.7 seconds
    100                 6.6 seconds
    400                 21.8 seconds

    Since the Idea Forest stress test was Deluge's first major outing,
the significance of the effect of some of these parameters wasn't known
at the beginning.  In fact, some of these parameters didn't exist at the
start of testing, and were added later when it became obvious that more
control was necessary.  Likewise, many of the reporting systems (like SU
pause time) were not part of the software until much later in the test.
The effect of the evolution of the code will be discussed more in the
Data chapter of this document.


----------------------------------------


Metric Issues

    The specification for stress testing of the Idea Forest site called
for "4000 users for 30 minutes, and each user has an average 120 second
page read time".  Once testing started, it was discovered that there were
several different interpretation of this spec.  Instead of limiting
ourselves to a single interpretation, it was decided to try and
accomplish as many of the interpretations as possible.

    It was also noted that if we weren't able to accomplish some of the
interpretations, we should find out what was limiting us.

    Here are the interpretations, and what was accomplished:

1] 4000 active ATG sessions are created within 30 minutes.

    Deluge defines an active SU as an SU that still has requests to make.
Another way to think of this is an SU that is not dead (because it didn't
restart), or not waiting in its restart_time.

    ATG defines an active user as a user who has made a request within
the last 30 minutes.  ATG doesn't retire a user until this 30 minute
limit has been reached.  Each active ATG user incurs a memory cost even
if that user isn't truly "active".

    Purely measuring ATG sessions, we accomplished this interpretation on
the first day of testing, and were easily hitting between 6000 and 8000
sessions in successive attacks.

2] 4000 Deluge users (SUs) visit the site within 30 minutes.

    When Idea Forest initially purchased the attack machines, we had no
idea how many SUs each machine could handle.  The hope was that each of
the four machines would be able to support 1000 SUs with an average 120
second delay_time.  The machines were all 500MHz Celeron processors with
256 MB of memory and 8.4 GB hard drives.

    A few days into testing, it was determined that each machine could
only run somewhere between 400 and 500 users.

    However, even with this limitation, since certain user types
restarted very quickly, we were able to reach 4000 Deluge SUs completing
their tasks within 30 minutes.

3] 4000 Deluge users (SUs) are concurrently active for 30 minutes.

    Because the attack machines were only able to each support 400-500
SUs with a delay_time of 120 seconds, we were only able to run about 2000
concurrent active Deluge SUs.  Major efforts were made to expand our
testing facilities by installing Linux and Deluge onto five IBM
ThinkPads.  Unfortunately, mainly because of the limited RAM on these
machines, they didn't contribute much to the attacks.

    It's important to note that we were unable to reach this
interpretation only due to the power of the test machines.  The site
itself had no significant slowdown with 2000 Deluge SUs, and the CPU load
on the servers with that many SUs suggested that the site would have good
response if we were able to reach 4000 Deluge SUs.


----------------------------------------


Attack and Result Data


User Types

    As discussed in the Deluge Overview section, different user types can
be defined within an attack to better simulate the people that will be
visiting the site.  Idea Forest, Fort Point, and MimEcom all provided
data in the form of demographics, scripts, and user activity patterns to
help construct the different Deluge user definitions that were used.

    This table names the different user types, and gives some information
about how they each affect the site.

Name         Type      Pages    Description                 Stress on site
====         ====      =====    ===========                 ==============

Surfer       Robot     1        Load top page               Web server

Browser      Robot     5        Start at top, wander to     Web server
                                4 other pages

Searcher     Playback  6        Start at top, use search    Web server
                                engine to get to 5 other    Search engine
                                pages

Browse/Buy   Playback  10       Start at top, wander to     Web server
                                products, buy               Order system
                                                            User & product DBs

Search/Buy   Playback  8        Start at top, search to     Web server
                                products, buy               Search engine
                                                            Order system
                                                            User & product DBs

Register     Playback  5        Start at top, register      Web server
                                new account                 User database

Street_Read  Playback  6        Start at top, read Street   Web server
                                Fair posts                  Street Fair code/DB

Street_Post  Playback  8        Start at top, post a        Web server
                                Street Fair ad              Street Fair code/DB


    Note that the users employing playback scripts often used variable
replacement within the scripts to make sure they were all doing slightly
different things.  Also, in some cases, multiple scripts were recorded
for a single user type to further vary the actions.  For example, the
Browse/Buy user type was actually made up of several different user_defs,
each with a different playback script.

    The number of SUs of each type and their page read delay varied from
attack to attack, depending on what kind of stress we were applying to
the site.  Cookies and images were likewise allowed or disallowed as the
attack required.


Attacks

    Many different types of attacks were run, as shown in the table
below.  While the primary attacks were designed to match the expected
user cross-section on the site, some tests were tailored to stress
specific areas of the site.


Attack                Description                  User Types       Images
======                ===========                  ==========       ======

Registration Stress   Register new accounts.       Register         No

Street Fair Stress    Post and read Street         Street_Read      No
                      Fair ads.                    Street_Post

Cross Section         A cross section of all       All              Yes
                      user types, with the number
                      of SUs of each type
                      distributed according to
                      anticipated live load.

No Forms              All the user types that      Surfer           Yes
                      don't fill out forms.        Browser
                                                   Searcher


    There were many more tests than these.  The results of a good cross
section of the tests are being reported here.  Other types of tests run
included tests to stress the site while Fort Point or MimEcom tracked
down bugs or looked for problems, tests to test and tune Deluge, and
other maintenance related runs.


----------------------------------------


Attacks - Registration Stress

User Types          Instances    delay_time
    Register        400          15

timeout             15 sec
threads_per_proc    100
queue_max_delay     0.75 sec
user_ramp_time      20 sec

NOTE TO READERS:  Graphs not available in text version, only in Microsoft
                  Word version.

Response Code            Count
1xx - Informational      0
2xx - Success            17137
3xx - Redirect           11124
4xx - Incomplete         1
5xx - Server Error       6

Comments:
    - There are far fewer SUs than normal, but since delay_time is set so
      low, it's possible to just as many accounts with less memory usage
      inside Deluge.
    - The Idea Forest site has a lot of redirect pages.  This test was
      run before Deluge was modified to treat redirects as the same page.


----------------------------------------


Attacks - Street Fair Stress

User Types         Instances    delay_time
    Street_Read    320          12
    Street_Post    80           12

timeout            15 sec
threads_per_proc   100
queue_max_delay    0.75 sec
user_ramp_time     20 sec

NOTE TO READERS:  Graphs not available in text version, only in Microsoft
                  Word version.

Response Code            Count
1xx - Informational      0
2xx - Success            63453
3xx - Redirect           3918
4xx - Incomplete         241
5xx - Server Error       2

Comments:
    - There are far fewer SUs than normal, but since delay_time is set so
      low, it's possible to just as many accounts with less memory usage
      inside Deluge.
    - The Idea Forest site has a lot of redirect pages. This test was run
      before Deluge was modified to treat redirects as the same page.


----------------------------------------


Attacks - Cross Section

User Types         Instances    delay_time
    Surfer         1700         120
    Browser        800          120
    Searcher       800          120
    Browse/Buy     170          120
    Search/Buy     170          120
    Register       160          120
    Street_Read    160          120
    Street_Post    40           120

timeout            30 sec
threads_per_proc   100
queue_max_delay    0.75 sec
user_ramp_time     180 sec

NOTE TO READERS:  Graphs not available in text version, only in Microsoft
                  Word version.

Response Code            Count
1xx - Informational      0
2xx - Success            45673
3xx - Redirect           536
4xx - Incomplete         466
5xx - Server Error       0

Comments:
    - This test was run before the queuing method had been worked out,
      and also before the significance of threads_per_proc was known.  The
      graphs show how much this can affect URL and page load times.


----------------------------------------


Attacks - No Forms

User Types         Instances    delay_time
    Surfer         948          120
    Browser        484          120
    Searcher       568          120

timeout            30 sec
threads_per_proc   10
queue_max_delay    0.6 sec
user_ramp_time     180 sec

NOTE TO READERS:  Graphs not available in text version, only in Microsoft
                  Word version.

Response Code            Count
1xx - Informational      0
2xx - Success            106345
3xx - Redirect           710
4xx - Incomplete         3258
5xx - Server Error       0

Comments:
    - The high number of 4xx response codes is actually all 404 errors
      for missing images.  This test was run on 8/8/2000, and shows that
      some product images are still missing.


----------------------------------------


Suggestions for Future Testing


Deluge Issues

    Deluge needs to be able to show more real-time data, instead of doing
all log parsing as a post process.  Since test machines are so cheap, it
would probably be best to dedicate a machine to this effort, since it's
of utmost importance that the attack machines are not burdened with
anything but dlg_attack.

    The real time reporting machine could also be used to control and
adjust the attack during a run.  This would also require some changes to
Deluge, as it currently doesn't allow for much alteration of an attack
once it's started.

    There are a large number of other improvements that Deluge can use.
For the most part, they're all kept in a TODO file in the Deluge
distribution.


Hardware Issues

    Now that a general idea of the hardware horsepower needed for a
certain level of attack is known, it should be easier to construct a
portable rack for testing multiple sites.  A rack of eleven machines (10
for attacking, 1 for control and reporting) of machines that are slightly
faster than the ones used in this test are recommended to test a site
with the equivalent expected load and demographic as Idea Forest.

    When the site appeared to be slow, it was sometimes hard to isolate
where the problem was occurring.  This was often due to the lack of
knowledge and access to Level 3's network (through no fault of Idea
Forest, Fort Point, or MimEcom).  For the next test, a network engineer
from the ISP should probably be involved in the testing.


Attack Issues

    As described in the "Timing" section, it's very important that some
time at the beginning of the testing phase be dedicated to tuning Deluge
for the attack.  Now that Deluge is more mature and stable than it was at
the beginning of this test, it should be much easier in the future.


