howtouse.txt


Using rubylexer:
require "rubylexer.rb"
  #then later
lexer=RubyLexer.new(a_file_name, opened_File_or_String)
until EoiToken===(token=lexer.get1token)
  #...do stuff w/ token...
end

For a slightly expanded version of this example, see test/code/dumptokens.rb.

tok will be a subclass of Token. there are many token classes (see token.rb)
however, all tokens have some common methods:
to_s   #return a string containing ruby code representing that token
ident  #return internal form of token; use with caution
offset #offset in file of start of token
error  #returns a string if there was a lex error at this position, else nil

here's a list of token subclasses and their meaning:
(note: indentation indicates inheiritance)

WToken       #(mostly useless?) abstract superclass for KeywordToken, 
             #OperatorToken, VarNameToken, and HerePlaceholderToken
             #but not (confusingly) MethNameToken (perhaps that'll change)
 KeywordToken #a ruby keyword or non-overridable punctuation char(s)
 OperatorToken #overrideable operators. 
               #use #unary? and #binary? to find out how many arguments it takes.
 VarNameToken #a name that represents a variable
 HerePlaceholderToken #represents the header of a here string. subclass of WToken
MethNameToken  #the name of a method: the uncoloned
                 #symbols allowed in 'alias' and 'undef' statements and all names
                 #which follow a 'def',
                 #'::', or '.', as well as other call sites. operators used as
                 #method names will appear as methnametokens.
                 #confusingly, this is not a WToken.
NumberToken  #a literal number, including character constants
SymbolToken  #a symbol
NewlineToken #represents an (unescaped) newline.
StringToken  #represents a string. unlike all other tokens, strings might contain
             #other tokens. if the string used interpolation, tokens inside #{ }
             #are considered subtokens of the string. StringToken#elems returns
             #an array whose elements are sections of uninterpolated string (in
             #the even indeces) and arrays of subtokens (in the odd indeces).
             #this notion of subtokens is an unfortunate one and will go away in
             #a future release.
 RenderExactlyStringToken  #a subclass of StringToken; used to represent regexes and other string-like thingys

ErrorToken   #actually a module that may be mixed in to any token. indicates an error in the input at (or
             #near) that position. You may continue getting tokens after an error token is encountered, 
             #and I try to make this work as well as possible, but I can not guarantee correctness after
             #an error.
             #note: any token may be an ErrorToken, including IgnoreToken, EoiToken, a subtoken of a 
             #StringToken, etc. Please take this into account in your error processing.

IgnoreToken  #superclass for tokens without semantic meaning to a parser
 WsToken      #whitespace
 EscNlToken   #implicitly or explicitly escaped newline
 EoiToken     #end of source file. always the last token
 HereBodyToken        #the actual body of the here string. subclass of IgnoreToken
  OutlinedHereBodyToken #hacky subclass of HereBodyToken... will disappear once strings are done right.

 ZwToken      #informational IgnoreTokens. (parsers might need to look at some of these, actually.)
  NoWsToken    #no whitespace was on either side of this token. kind of a hack
               #to help TokenPrinter work correctly in certain cases.

  ImplicitParamListStartToken #if you leave the parentheses out in a function
  ImplicitParamListEndToken   #call, a pair of these will be generated instead

  KwParamListStartToken       #the when,for,and rescue keywords take a comma-
  KwParamListEndToken         #delimited list. these tokens enclose those lists.

  AssignmentRhsListStartToken #encloses the right hand side of an assignment,
  AssignmentRhsListEndToken   #including both single and multiple assignment.

  FileAndLineToken   #generated at every newline, escaped or unescaped. the file
                     #and line methods of this class return the file and line at
                     #that point in the token stream. (not always working right now.)


Subclasses of WToken provide an === method for comparing the token to a String or Regexp.

The different types of string and how to distinguish: 
For the most part you can tell what was what by looking at StringToken#char. 
Single and double quotes can't be distinguished this way, and neither can you tell a fancy 
string (starting with %) from the regular kind. If you want to have more... one option 
with the current code is to just go and look in your input what was at StringToken#offset. 

Eventually, (in version 0.8) string boundaries, bodies, and inclusions will all be separate 
tokens laid out linearly in the token stream. This is the way matz handles things, and it's 
much cleaner.  I'll make sure that the string start token at that time contains all the 
info you could want. If you really can't wait for 0.8 and can't stand #offset, I can hack 
in a method to StringToken that tell you exactly what char(s) opened the string.

Certain keywords (if, unless, while, until, do) may or may not have an associated end keyword.
For instance, these have ends associated:

if somthing then
  do_somthing
end

a.each do|x| 
  x.something_about_it 
end

And these do not:

do_something if something

for x in a do 
  x.something_about_it 
end #paired to the for, not the do!

A KeywordToken is generated by rubylexer in either case, but you can now use the has_end?
method of KeywordToken to determine whether an end should be expected for a particular if or not. 

api stability:
Future changes to the user-visible api will happen in a backwards-compatible way, so that
if the interface changes, there will be a (probably quite long) transition period during 
which both the old and new interfaces are supported. The idea is to give users plenty of 
time to adapt to changes. That promise goes for all the changes described below.

In cases where the 2 are incompatible, (inspired by rubygems) I've come up with this:

  require 'rubylexer/0.6'
  rl=RubyLexer.new(...args...)  #request the 0.6 api
  
This actually works currently; it enables the old api where errors cause an exception instead
of generating ErrorTokens. The default will always be to use the new api.

StringToken will go away; replaced by multiple token types, like in ruby.  StringToken
subclasses will need reorganization at this point too... tokens in an interpolation
will no longer be 'subtokens' but full-fledged tokens in their own right.
i intend to make a namespace for all rubylexer classes at some point... shouldn't
be a big deal; old clients can just include the namespace module.
Token#ident may be taken away or change without notice.
MethNameToken may become a WToken
HereBodyToken should really be a string subclass...
Newline,EscNl,BareSymbolToken may get renamed