net.sourceforge.apphere.util.html
Class HtmlStreamTokenizer

java.lang.Object
  extended bynet.sourceforge.apphere.util.html.HtmlStreamTokenizer

public class HtmlStreamTokenizer
extends java.lang.Object

HtmlStreamTokenizer is an HTML parser that is similar to the StreamTokenizer class but is specialized for HTML streams. This class is useful when you need to parse the structure of an HTML document.

 
 import adc.parser.*;
 

HtmlStreamTokenizer tok = new HtmlStreamTokenizer(inputstream); HtmlTag tag = new HtmlTag(); while (tok.nextToken() != HtmlStreamTokenizer.TT_EOF) { int ttype = tok.getTokenType(); if (ttype == HtmlStreamTokenizer.TT_TAG) { tok.parseTag(tok.getStringValue(), tag); System.out.println("tag: " + tag.toString()); } else if (ttype == HtmlStreamTokenizer.TT_TEXT) { System.out.println("text: " + tok.getStringValue()); } else if (ttype == HtmlStreamTokenizer.TT_COMMENT) { System.out.println("comment: <!--" + tok.getStringValue() + "-->"); } }

One of the motivations for designing parseTag() to take an HtmlTag argument rather than having parseTag() return a newly created HtmlTag is so you can create your own tag class derived from HtmlTag.

Version:
2.01 09/12/97
Author:
Arthur Do
See Also:
adc.parser.HtmlTag, adc.parser.Table

Field Summary
static int TT_COMMENT
          comment token.
static int TT_EOF
          end of stream.
static int TT_TAG
          tag token.
static int TT_TEXT
          text token.
 
Constructor Summary
HtmlStreamTokenizer(java.io.InputStream in)
           
HtmlStreamTokenizer(java.io.Reader in)
           
 
Method Summary
 int getLineNumber()
           
 java.lang.StringBuffer getStringValue()
           
 int getTokenType()
           
 java.lang.StringBuffer getWhiteSpace()
           
 int nextToken()
           
static void parseTag(java.lang.StringBuffer sbuf, HtmlTag tag)
          The reason this function takes an HtmlTag argument rather than returning a newly created HtmlTag object is so that you can create your own tag class derived from HtmlTag if desired.
static java.lang.String unescape(java.lang.String buf)
          Replaces HTML escape sequences with its character equivalent, e.g.
static void unescape(java.lang.StringBuffer buf)
          Replaces HTML escape sequences with its character equivalent, e.g.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

TT_EOF

public static final int TT_EOF
end of stream.

See Also:
Constant Field Values

TT_TEXT

public static final int TT_TEXT
text token.

See Also:
Constant Field Values

TT_TAG

public static final int TT_TAG
tag token.

See Also:
Constant Field Values

TT_COMMENT

public static final int TT_COMMENT
comment token.

See Also:
Constant Field Values
Constructor Detail

HtmlStreamTokenizer

public HtmlStreamTokenizer(java.io.Reader in)
Parameters:
in - input reader

HtmlStreamTokenizer

public HtmlStreamTokenizer(java.io.InputStream in)
Parameters:
in - input stream
Method Detail

getTokenType

public final int getTokenType()
Returns:
token type, one of the TT_ defines

getStringValue

public final java.lang.StringBuffer getStringValue()
Returns:
string value of the token

getWhiteSpace

public final java.lang.StringBuffer getWhiteSpace()
Returns:
any white space accumulated since last call to nextToken

getLineNumber

public int getLineNumber()
Returns:
current line number. Every time nextToken() sees a new line character ('\n'), it increments the line number.

nextToken

public int nextToken()
              throws java.io.IOException
Returns:
the next token
Throws:
java.io.IOException - if error reading input stream.

parseTag

public static void parseTag(java.lang.StringBuffer sbuf,
                            HtmlTag tag)
                     throws HtmlException
The reason this function takes an HtmlTag argument rather than returning a newly created HtmlTag object is so that you can create your own tag class derived from HtmlTag if desired.

Parameters:
sbuf - text buffer to parse
tag - parse the text buffer and store the result in this object
Throws:
HtmlException - if malformed tag.

unescape

public static java.lang.String unescape(java.lang.String buf)
Replaces HTML escape sequences with its character equivalent, e.g. &copy; becomes ©.

Parameters:
buf - text buffer to unescape
Returns:
a string with all HTML escape sequences removed

unescape

public static void unescape(java.lang.StringBuffer buf)
Replaces HTML escape sequences with its character equivalent, e.g. &copy; becomes ©.

Parameters:
buf - will remove all HTML escape sequences from this buffer