Wednesday, June 27, 2012

Recognition of reserved words and identifiers


Recognizing keywords and identifiers presents a problem. Usually, keywords like if or then are reserved, so they are not identifiers even though they look identifiers. Then although we typically use a transition diagram like that. fig to search for identifier lexemes this diagram will also recognize the keyword if then else of an running example.


return (get Token (), install des)
There are two ways that we can handle reserved words that look like identifiers.
      1.  Install the reserved word,in the symbol table initially .A field of the symbol-table entry indicates that  these strings are never  ordinary identifiers ,and tells which token they represent .when we find an identifier a call to install ID places it in the symbol table if it is not already there and returns a pointer t the symbol-table entry for the lexeme found. Of course, any identifier not in the symbol table during lexical analysis cannot    be a reserved word, so its token is id.
      2. Create separate transition diagram for each keyword: an example for the keyword then is shown fig .Note that such a transition diagram consist of states representing the situation after each successive letter of the keyword  is seen .followed by a text for a “non letter-or-digit” .i.e., any character that cannot be the continuation of an identifier.

Transition diagram for recognition of tokens


Recognition of tokens:                                                               
We learn how to express pattern using regular expressions. Now, we must study how to take the patterns for all the needed tokens and build a piece of code that examins  the input string and finds a prefix that is a lexeme matching one of the patterns.                                                                                                                             

 Stmt àif expr then  stmt                                                                                                        
             |  If expr then else stmt
             | Ñ”
Expr  àterm relop term
              | term
Term àid
              |number
For relop ,we use the comparison operations of languages like Pascal or SQL where = is “equals”  and < > is “not equals” because it presents  an interesting structure of  lexemes.
The terminal of grammar, which are if, then , else, relop ,id and numbers are the names of  tokens as far as the lexical analyzer is concerned, the  patterns for the tokens are described using regular definitions.

digit        -->[0,9]                                                                                                        
digits      -->digit+                                                                                                                      
number   -->digit(.digit)?(e.[+-]?digits)?                                                                           
letter       -->[A-Z,a-z]                                                                                                     
 id           -->letter(letter/digit)*                                                                                                   
 if            --> if                                                                                                                          
 then        -->then                                                                                                               
 else        -->else                                                                                                          
relop       --></>/<=/>=/==/< > 

In addition, we assign the lexical analyzer the job stripping out white space, by recognizing the “token” we defined by:
            ws à(blank/tab/newline)+
Here, blank, tab and newline are abstract symbols that we use to express the ASCII characters of the same names. Token ws is different from the other tokens in that ,when we recognize it, we do not return it to parser ,but rather restart the lexical analysis from the character that follows the white space . It is the following token that gets returned to the parser. 

          Lexeme   
Token Name 
      Attribute Value      
         Any ws
           _ 
            _
             if
          if  
            _
           then
       then
            _
           else            
       else    
            _
         Any id  
         id  
   pointer to table entry
      Any number 
     number
   pointer to table entry               
           <      
       relop
           LT        
          <=
       relop
           LE 
            =  
       relop
           ET  
          < >    
      relop
           NE   

 
Transition Diagram:
            Transition Diagram has a collection of nodes or circles, called states. Each state represents a condition that could occur during the process of scanning the input  looking for a lexeme that matches one  of several patterns .
   Edges are directed from one state of the transition diagram to another. each edge is labeled by a symbol or set of symbols.
If we are in one state s, and the next input symbol is a, we look for  an edge out of state s labeled by a. if we find such an edge ,we advance the forward          pointer and enter the state of the  transition diagram to which that edge leads.
Some important conventions about transition diagrams are
      1. Certain states are said to be accepting or final .These states indicates that a lexeme has been found, although the actual lexeme may not consist of all positions  b/w the lexeme Begin and forward pointers we always indicate an accepting state by a double circle.
      2.   In addition, if it is necessary to return the forward pointer one position, then we shall additionally place a * near that accepting state.
          3. One  state is designed the  state ,or initial state ., it is indicated by an edge labeled “start” entering from nowhere .the transition diagram always begins in the state before any input symbols have been used. 


Regular Defination


We may wish to give to regular expression and to define regular expression using these names as if they were symbols.                                                                           
If ∑ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form,                                                                                                       
d1 --> r1                                                                                                                           
d2 --> 2                                                                                                                            
  …….                                                                                                                                 
dn --> rn
Where each di is a distinct name, and each ri is a regular expression over the symbols in  ∑ u{d1,d2,….dn}, ie. the basic symbols and the previously defined names. By restricting each ri to symbols of  ∑ and the previously defined names, we can construct a regular expression over ∑  for any ri by repeatedly replacing regular expression names by the expression they denote. If ri used dj for some j>=i, then ri might be recursively defined .and this substitution process would not terminate.                                                                                
To distinguish names from symbols, we print the names in regular definitions in bold face.