Test Design

The following guidelines are currently PROPOSED and being discussed on the development mailing list dev@harmony.apache.org. Please direct comments and questions there.

Stress tests are built from simple building blocks according to configuration strings.
Tests have junit interface.
[Case study] Imagine someone puts tests into SVN which implements different test interface. To reuse them we can add another generator to convert these tests to junit interface.
Configuration string list is maintained manually. If we plan to use junit runner to launch a sequent of the stress tests, then the most straightforward model is to wrap configuration strings into junit test cases and put documentation into javadoc for these test cases.

Further Steps

Stress tests are expected to generate relevant bugs. Since usually stress behavior is unspecified, we need to introduce something measurable instead of pass/fail result for the stress tests. See comparative approach below.
All should create tests and run them against Harmony VM and RI. This would be a real-life testing for our approach.

Comparative Approach

The simplest example of comparative apporach is the following.

Tester: My test fails on Harmony VM and passes on RI. Please, fix Harmony VM.

This usually does not work for stress tests.

Developer: Who told you that OutOfMemoryError should be thrown in your thread? My finalizer thread is just a normal java thread, like yours, and it can fail as well. You have a bug in your test.

There are multiple reasons why we always will have such bugs in the tests.

These bugs keep showing up. The time to fix all these bugs regularly is too high.
Stress testing reuses tests which are usually not designed for stress execution, for example, multithread execution.
These bugs are dependent on VM internal structure. Test authors do not posess sufficient knowledge of the problem and the structure.
Sometimes Java is not rich enough.

How can we have a maintainable test product takung all this limitation into account? We need to learn how to live with occasional failures of the stress tests. This means, instead of fail, the test should better report how good it is on Harmony VM compared to RI:

Failures with the worst relative metric can be evaluated first
We can detect that a relative metric for a test worsened on the recent build

Developers are better convinsed to fix "the worst issue" or "dergadation" instead of "some issue".

Several metrics for each test:

Pass rate: assuming the test is 100% reliable on RI we can calculate a percentage of failures.
Number of times the test can be executed sequentionally before a fail
Memory consumption: a generator can preallocate more and more memory before launching the test in a loop.
Max threads supported: a generator can exponentially increase number of threads launching the test in parallel.
Execution time: all this apparatus is quite close to performance testing methodology. There is no need to compete with them in their field though.