This section of the archives stores flipcode's complete Developer Toolbox collection, featuring a variety of mini-articles and source code contributions from our readers.

 

  Tokenizer Class
  Submitted by



This is a tokenizer class much like the one found in Java API, StreamTokenizer. You instance a tokenizer class by wrapping up an input stream of istream kind and then just use nextToken() method to get token after token. The tokenizer is table driven and can easily be configured. By setting some flags you can get /* and // comments ignored automatically by the tokenizer, much like the Java class. Other features are line counter, quoted strings, number tokens and more... Note though that I put this one together last night and as it seems it works perfectly for my current project (where I btw parse Q3 style shader scripts :), but I havn't done any serious testing so there may be a couple of bugs... Anyway.. the code is free to use and so forth... best regards,
Anders Pistol


Currently browsing [tokenizer.zip] (3,234 bytes) - [tokenizer.h] - (4,665 bytes)

#ifndef _Tokenizer_H
#define _Tokenizer_H

#include <iostream> #include <string>

#ifndef NCHAR # define NCHAR 255 #endif #define CT_WHITESPACE 1 #define CT_DIGIT 2 #define CT_ALPHA 4 #define CT_QUOTE 8 #define CT_COMMENT 16 #define TT_EOF -1 #define TT_EOL '\n' #define TT_NUMBER -2 #define TT_WORD -3 #define TT_STRING -4

/** * The tokenizer class takes an input stream and parses it * into "tokens", allowing the tokens to be read one at a time. * The parsing process is controlled by a table and a number * of flags that can be set to various states. The tokenizer * can recognize identifiers, numbers, quoted string and * various comment styles. */ class Tokenizer { private: std::istream& m_input; /**< Reference to input stream. */ unsigned char m_chType[NCHAR + 1]; /**< Table where each characters type is represented. */ bool m_eolIsSignificant; /**< Flag if eol is significant, ie will return a token. */ int m_lineNo; /**< Current line number. */ bool m_lowerCaseMode; /**< If word tokens are to be forces lower case. */ bool m_pushedBack; /**< If pushBack() has been called. */ bool m_slslComments; /**< If slash-slash comments are enabled. */ bool m_slstComments; /**< If slash-star comments are enabled. */ int m_ttype; /**< Token type. */ double m_nval; /**< Numeric value. */ std::string m_sval; /**< String value. */

public: /** * Initialize tokenizer by attaching an input * stream. Also resets the tokenizer to default. * @param input Reference to input stream. */ Tokenizer(std::istream& input);

/** * Virtual destructor so if this class is used as * a base class, the destructors are called in * proper order. */ virtual ~Tokenizer();

/** * Specifies which character starts a single-line comment. * @param ch Single-line comment character. */ void commentChar(int ch);

/** * Determines whether or not ends of line are treated * as tokens. * @param flag True if eol are significant. */ void eolIsSignificant(bool flag);

/** * Return the current line number. * @return Line number. */ int lineno(void);

/** * Determines whether or not word tokens are * automatically lowercased. * @param flag True if tokens shall be lowercased. */ void lowerCaseMode(bool flag);

/** * Parses the next token from the input stream * of this tokenizer. * @return The value of ttype member. */ int nextToken(void);

/** * Specifies that the character argument is "ordinary" * in this tokenizer. * @param ch Character. */ void ordinaryChar(int ch);

/** * Specifies that all characters c in the range * low <= c <= hi are "ordinary" in this tokenizer. * @param low Low range. * @param hi High range. */ void ordinaryChars(int low, int hi);

/** * Specifies that numbers should be parsed * by this tokenizer. */ void parseNumbers(void);

/** * Causes the next call to the nextToken method of this * tokenizer to return the current value in the ttype member, * and not to modify the value in nval or sval fields. */ void pushBack(void);

/** * Specifies that matching pairs of this character delimit * string constants in this tokenizer. * @param ch Quote character. */ void quoteChar(int ch);

/** * Reset the tokenizer's syntax table so that all characters * are "ordinary". See the ordinaryChar method for more * information on characters being ordinary. */ void resetSyntax(void);

/** * Determines whether or not the tokenizer recognizes * C++-style comments. */ void slashSlashComments(bool flag);

/** * Determines whether or not the tokenizer recognizes * C-style comments. */ void slashStarComments(bool flag);

/** * Specifies that all characters c in the range low <= c <= high, * are white space characters. */ void whitespaceChars(int low, int hi);

/** * Specifies that all characters c in the range low <= c <= high, * are word constituents. */ void wordChars(int low, int hi);

/** * Return token type. * @return Value of m_ttype. */ inline int ttype(void) const { return m_ttype; }

/** * Return token number value. * @return Value of m_nval; */ inline double nval(void) const { return m_nval; }

/** * Return token string value. * @return Value of m_sval; */ inline const std::string& sval(void) const { return m_sval; } };

#endif // _Tokenizer_H

Currently browsing [tokenizer.zip] (3,234 bytes) - [tokenizer.cpp] - (6,541 bytes)

#include "tokenizer.h"

/////////////////////////////////////////////////////////////////////////////// Tokenizer::Tokenizer(std::istream& input): m_input(input) { // set flags to default m_eolIsSignificant = false; m_lineNo = 0; m_lowerCaseMode = false; m_pushedBack = false; m_slslComments = true; m_slstComments = true; // reset tables resetSyntax(); wordChars('a', 'z'); wordChars('A', 'Z'); wordChars(128, 255); wordChars('_', '_'); quoteChar('"'); quoteChar('\''); whitespaceChars(0, ' '); parseNumbers(); }

/////////////////////////////////////////////////////////////////////////////// Tokenizer::~Tokenizer() { }

/////////////////////////////////////////////////////////////////////////////// void Tokenizer::commentChar(int ch) { if(ch >= 0 && ch <= NCHAR) { m_chType[ch] = CT_COMMENT; } }

/////////////////////////////////////////////////////////////////////////////// void Tokenizer::eolIsSignificant(bool flag) { m_eolIsSignificant = flag; }

/////////////////////////////////////////////////////////////////////////////// int Tokenizer::lineno(void) { return m_lineNo; }

/////////////////////////////////////////////////////////////////////////////// void Tokenizer::lowerCaseMode(bool flag) { m_lowerCaseMode = flag; }

/////////////////////////////////////////////////////////////////////////////// int Tokenizer::nextToken(void) { #define _CTYPE(c) (((c) < 0) ? CT_WHITESPACE : (((c) > NCHAR) ? CT_ALPHA : m_chType[c]))

// check if token has been pushed back if(m_pushedBack == true) { m_pushedBack = false; return m_ttype; }

// get character from input stream int ch = m_input.get(); if(ch < 0) return TT_EOF; int ctype = _CTYPE(ch);

// strip possible white space while((ctype & CT_WHITESPACE) == CT_WHITESPACE) { if(ch == '\n') { ++m_lineNo; if(m_eolIsSignificant == true) { return (m_ttype = TT_EOL); } } ch = m_input.get(); if(ch < 0) return TT_EOF; ctype = _CTYPE(ch); }

// parse number if((ctype & CT_DIGIT) == CT_DIGIT) { // if we got a minus sign we must also have a digit. if(ch == '-') { int pk = m_input.peek(); if((_CTYPE(pk) & CT_DIGIT) == 0 || pk == '-') return (m_ttype = ch); }

// get entire number std::string strnum = ""; while((ctype & CT_DIGIT) == CT_DIGIT) { strnum += ch; int pk = m_input.peek(); if((_CTYPE(pk) & CT_DIGIT) == 0 && pk != '.') break; ch = m_input.get(); if(ch < 0) break; ctype = CT_DIGIT; }

// return number token m_nval = atof(strnum.c_str()); return (m_ttype = TT_NUMBER); }

// parse word if((ctype & CT_ALPHA) == CT_ALPHA) { std::string strwrd = ""; while((ctype & CT_ALPHA) == CT_ALPHA) { strwrd += ch; int pk = m_input.peek(); if((_CTYPE(pk) & CT_ALPHA) == 0) break; ch = m_input.get(); ctype = CT_ALPHA; } // force lower case word if(m_lowerCaseMode == true) { for(int i = 0; i < strwrd.length(); i++) strwrd[i] = tolower(strwrd[i]); } // return word token m_sval = strwrd; return (m_ttype = TT_WORD); }

// parse single-line comment if((ctype & CT_COMMENT) == CT_COMMENT) { while((ch = m_input.get()) >= 0 && (ch != '\n' && ch != '\r')); return nextToken(); }

// parse quoted string if((ctype & CT_QUOTE) == CT_QUOTE) { m_ttype = ch; std::string strqte = ""; while((ch = m_input.get()) >= 0 && (ch != m_ttype && ch != '\n' && ch != '\r')) { strqte += ch; } // return quote token m_sval = strqte; return (m_ttype = TT_STRING); }

// parse comments if(ch == '/' && (m_slslComments || m_slstComments)) { int pk = m_input.peek(); if(pk == '/' && m_slslComments == true) { // ignore characters until line break while((ch = m_input.get()) >= 0 && (ch != '\n' && ch != '\r')); ++m_lineNo; // return next token if(ch < 0) return TT_EOF; return nextToken(); } else if(pk == '*' && m_slstComments == true) { int pch = 0; while((ch = m_input.get()) >= 0 && (ch != '/' || pch != '*')) { if(ch == '\n') ++m_lineNo; pch = ch; } if(ch < 0) return TT_EOF; return nextToken(); } }

// return character token return (m_ttype = ch); }

/////////////////////////////////////////////////////////////////////////////// void Tokenizer::ordinaryChar(int ch) { if(ch >= 0 && ch <= NCHAR) { m_chType[ch] = 0; } }

/////////////////////////////////////////////////////////////////////////////// void Tokenizer::ordinaryChars(int low, int hi) { if(low < 0) low = 0; if(hi > NCHAR) hi = NCHAR; while(low <= hi) { m_chType[low] = 0; ++low; } }

/////////////////////////////////////////////////////////////////////////////// void Tokenizer::parseNumbers(void) { for(int i = '0'; i <= '9'; i++) { m_chType[i] |= CT_DIGIT; } m_chType['-'] |= CT_DIGIT; }

/////////////////////////////////////////////////////////////////////////////// void Tokenizer::pushBack(void) { m_pushedBack = true; }

/////////////////////////////////////////////////////////////////////////////// void Tokenizer::quoteChar(int ch) { if(ch >= 0 && ch <= NCHAR) { m_chType[ch] |= CT_QUOTE; } }

/////////////////////////////////////////////////////////////////////////////// void Tokenizer::resetSyntax(void) { for(int i = 0; i < NCHAR; i++) { m_chType[i] = 0; } }

/////////////////////////////////////////////////////////////////////////////// void Tokenizer::slashSlashComments(bool flag) { m_slslComments = flag; }

/////////////////////////////////////////////////////////////////////////////// void Tokenizer::slashStarComments(bool flag) { m_slstComments = flag; }

/////////////////////////////////////////////////////////////////////////////// void Tokenizer::whitespaceChars(int low, int hi) { if(low < 0) low = 0; if(hi > NCHAR) hi = NCHAR; while(low <= hi) { m_chType[low] |= CT_WHITESPACE; ++low; } }

/////////////////////////////////////////////////////////////////////////////// void Tokenizer::wordChars(int low, int hi) { if(low < 0) low = 0; if(hi > NCHAR) hi = NCHAR; while(low <= hi) { m_chType[low] |= CT_ALPHA; ++low; } }

The zip file viewer built into the Developer Toolbox made use of the zlib library, as well as the zlibdll source additions.

 

Copyright 1999-2008 (C) FLIPCODE.COM and/or the original content author(s). All rights reserved.
Please read our Terms, Conditions, and Privacy information.