Apache Harmony is retired at the Apache Software Foundation since Nov 16, 2011.

The information on these pages may be out of date, or may refer to resources that have moved or have been made read-only.
For more information please refer to the Apache Attic

Design of the Regex Processing Framework

Regex Processing Framework

  1. Revision History
  2. About This Document
    1. Purpose
    2. Intended Audience
    3. Documentation Conventions
  3. Introduction to Pattern Matching
  4. AbstractSet Interface
    1. Characteristics
    2. Methods Used
    3. Class Hierarchy
    4. Optimizations
  5. Usage Examples
  6. References

Revision History

Version Version Information Date
Initial version Nadya Morozova, Nikolay Kuznetsov: document created. December 16, 2005
Formatting update Nadya Morozova September 21, 2006

About This Document

Purpose

This document describes the java.util.regex package delivered as part of the Harmony classlibrary project. This document provides an overview of the overall architecture with focus on the aspects improving performance.

Intended Audience

The target audience for the document includes a wide community of engineers interested in using regular expression package and in further work with the product to contribute to its development. The document assumes that readers are familiar with Regular expressions [1, 2, 3], finite automata theory [4], basic compiler techniques [4] and the Java* programming language [5].

Documentation Conventions

This document uses the unified conventions for the Harmony documentation kit.

Back to Top

Introduction to Pattern Matching

To analyze text in search of sequences matching preset patterns, you can choose from a variety of techniques, such as the wide-spread regular expressions (RE) [1], exact string matching, for example, the Boyer-Moore algorithm (BM) [6], and others. However, the RE engine is rather complex, and significantly impacts performance, whereas the exact string matching techniques have a limited application.

For example, the regular expression .*world is used to find the last instance of world in a sentence. The BM string searching algorithm is more appropriate for solving the task than the RE technique, and is more effective from the performance perspective. To optimize using pattern matching techniques with different tasks, Harmony provides a unified interface that applies to any part of a regular expression, including the whole expression.

In terms of the Finite Automata theory, which is the basis of regular expression processing, a part of regular expression is a node. The Harmony regex framework treats every distinctive part of a regular expression and the whole expression as a node. Each node implements the unified interface adjusted for a specific technique. For instance, for the regular expression in the example, the Harmony framework includes a special class SequenceSet, which has a unified interface called AbstractSet, and implements the Boyer-Moore algorithm in searching for a word.

Back to Top

AbstractSet Interface

The key feature of the Harmony regex framework the single super interface, AbstractSet, shared by all types of nodes. Individual nodes are independent, so that every node of the automaton incorporates its own find-match algorithm optimized for a specific type of nodes. You can use methods implemented in the AbstractSet interface subclasses or override these methods to use your own more efficient algorithm.

Characteristics

The AbstractSet interface has the following key characteristics:

Back to Top

Methods Used

The AbstractSet interface defines the matching and finding methods, and the service methods supporting them, as follows.

Service Methods

accept()
Checks whether the current node matches the current string symbol(s) to ensure that the automaton can proceed with the current state. The method returns the number of accepted symbols. This method is designed for nodes representing tokens, which mostly use the default match procedure. To optimize the match procedure for a specific type of token, you only need to override the accept() method, see Example 1.
first()
Checks whether the first symbol of the current node intersects with previous one. If not, then the previous node is quantified possessively without backtrack (the option of restarting the search from the previous position).
next()
Returns the next node of the automaton.

Matching and Finding Methods

matches()
Runs the match procedure starting from the current state. This is the only method you need to override in order to introduce new functionality. By default, the matches() method does the following when working with terms (see Class Hierarchy):
  1. Calls the accept() method and proceeds if the method returns a positive value.
  2. Calls the match of the next node with the shift obtained from the accept() method.
  3. Returns TRUE in case of a successful match.
find()
Checks whether the pattern can match any part of the string in the following way:
  1. Finds the next position in the input string, for which the accept() method returns a positive value. If no matches are found, the method terminates and returns a negative value.
  2. From this position, runs the matches() method.
findBack()
Does the same as the find() method but starts from the end of the string or from the nearest new-line symbol. This method optimizes the find procedure when the current node of the pattern fits the rest of the string. In such cases, it is necessary to find the last occurrence of the pattern represented by the next node, as in the case of the .*world pattern.

Back to Top

Class Hierarchy

Figure 1 shows the class hierarchy based on the AbstractSet interface. As mentioned in its description, the whole hierarchy is based on this interface. The other classes of the framework are divided into the following categories:

Hierarchy of classes in regexp framework

Figure 1. Class Hierarchy

Figure 1 displays only the basics of the regex framework. This framework also includes other nodes responsible for different optimizations. For more details, see the comments inlined in code.

Back to Top

Optimizations

In the current implementation, most optimizations are based on the node representation to improve efficiency of the framework. It is noteworthy that an optimal finite automaton is built during compile time, so that the constructed automaton does not spend additional time on decision-making overhead during match time. The regex framework optimizations improve different aspects, such as:

Back to Top

Usage Examples

This part on the document illustrates usage of the Harmony regex framework.

Example 1

This example illustrates using the CharSet class, which includes all nodes accepting characters to create a new type of tokens. For that, the class uses the LeafSet class, which is the basic class for tokens. This example shows that the only method you need to override in order to present a new type of tokens is the accept() method.

class CharSet extends LeafSet {
    ...
    public int accept(int strIndex, CharSequence testString) {
        // checks that the current symbol of the input string is the one this 
        // instance represents and returns 1 (the number of
        // accepted symbols) or -1 if accept fails:
        return (this.ch == testString.charAt(strIndex)) ? 1 : -1;
    }
    ...
}

Example 2

The following example demonstrates independent implementation of the find() method for SequenceSet.

Note

This changes the find procedure for nodes representing character sequences and at the same time does not affect the find-match algorithms of other types of nodes.

class SequenceSet extends LeafSet {
    ...
    protected int indexOf(CharSequence str, int from) {
        // Boyer-Moore algorithm 
        ...
    }
    
    public int find(int strIndex, CharSequence testString,
            MatchResultImpl matchResult) {
        ...
        while (strIndex <= strLength) {
            // call the fast search method instead of default implementation
            strIndex = indexOf(testStr, strIndex);
            if (strIndex < 0) {
                return -1;
            }
            if (next.matches(strIndex + charCount, testString, matchResult) >= 0) {
                return strIndex;
            }
            strIndex++;
        }
        return -1;
    }
    ...
}

Example 3

This example illustrates how to turn the match procedure of the .* patterns into the find procedure, which is potentially faster.

class DotQuantifierSet extends LeafQuantifierSet {
    ...     
    public int matches(int stringIndex, CharSequence testString,
            MatchResultImpl matchResult) {
        ...
        // find line terminator, since .* represented by this node accepts all characters
        // except line terminators 
        int startSearch = findLineTerminator(stringIndex, strLength, testString);
        ...
        // run findBack method of the next node, because the previous 
        // part of the string is accepted by .* and no matter where
        // the next part of pattern is found, the procedure works OK.  
        return next.findBack(stringIndex, startSearch, testString, matchResult);
    }
}

Back to Top

References

[1] Jeffrey E. F. Friedl., Mastering regular expressions, 2nd Edition., July 2002, O'Reilly, ISBN 0-596-00289-0

[2] McNaughton, R. and Yamada, H. Regular Expressions and State Graphs for Automata, IRA Trans. on Electronic Computers, Vol. EC-9, No. 1, Mar. 1960, pp 39-47.

[3] Thompson, K., Regular Expression search Algorithm, Communication ACM 11:6 (1968), pp. 419-422.

[4] Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman, Compilers, Principles Techniques and Tools, Addison-Wesley Publishing Company, Inc., 1985, ISBN 0-201-10088-6

[5] Java Technology site, http://java.sun.com

[6] R. Boyer and S. Moore. A fast string searching algorithm. C. ACM, 20:762-772, 1977.

[7] D.E. Knuth, .I. Morris, and V. Pratt. Fast pattern matching in strings. SIAM J on Computing, 6:323-350, 1977.

Back to Top

* Other brands and names are the property of their respective owners.