How to make a parser with lex tool

How to make a parser with lex tool

Have you ever needed a parser to determine the correctness of a text or file? Have you ever wondered how to build a compiler for your programming language or framework? The lex and yacc tools can greatly assist in this.

Lex and yacc are two paired tools that you can use to create a text parser. Lex divides the input stream into tokens (words), while yacc takes the tokens and assembles them into a high-level construct, analogous to a sentence. Yacc is designed to work with the output flow of lex, though with additional code this can be bypassed. However, the lex output stream is designed so that it can serve as an input stream to another parser.

Their purpose is to read files that have an understandable structured format. For example, the code in most programming languages can be read using lex and yacc. Most files have a predictable format that can be read using these tools. Lex and yacc can be used for simple and regular grammar. Natural languages are still out of their reach, but most computer programming languages are within their reach.

In addition to lex and yacc, you will often encounter terms such as flex and bison. As lex and yacc are part of the standard BSD Unix package flex is a GNU alternative to lex, and bison is a GNU alternative to yacc.

Below are some simple examples of how to use lex. The text is intended for developers, and no prior knowledge of automation theory is necessary.

Installation of lex and yacc

If you want to install lex and yacc on a Debian / Ubuntu system, you can do this by using the following command in the terminal:

$ sudo apt-get install byacc flex

If you are using Windows, you will need to download the flex and bison installations from the http://gnuwin32.sourceforge.net/packages.html link and install them. After that, in System Properties -> Enviroment Variables -> Path, add the path to the folder where flex and bison are installed.

Running lex and yacc programs

The Lex program is written in a file whose extension is “l”. To obtain a lexical analyzer, the following command must be executed:

$ lex <program-name> .l

In this way, we generated a lexical analyzer, which is located in the file lex.yy.c.

On the other hand, a yacc program is written in a file whose extension is y. To get a syntax parser, you need to execute the following command:

$ yacc -d <program-name> .y

This will generate the syntax parser, which is contained in the y.tab.c file and the header file y.tab.h. To get the executable file, you need to run the following command:

$ gcc -o <program-name> lex.yy.c y.tab.c -lfl

If we only need the executable file of a lexical analyzer, we can do this as follows:

$ gcc -o <program-name> lex.yy.c -lfl

Lex: lexical analyzer generator

A lexical analyzer is a program that divides an input stream into identifiable parts. Lex adopts a problem-oriented specification for matching an input stream with rules and produces a general-purpose program that recognizes regular expressions. Regular expressions are specified by the user in the original lex specification.

Next, Lex converts regular expressions and actions into a generated program called yylex. The yylex program will recognize regular expressions in the input stream and perform actions for each match with the given rules.

For example, a simple lexical analyzer might count words in its input stream:

int numberLine = 0, numberRec = 0;

%%
\ n ++ numberLines;
[A-For-z] +   ++ numberRec;
. ;

%%
main () {
     yylex ();
     printf (“number of lines =% d, number of words =% d \ n”, number of lines, number of words);
}

Programming lex programs

The Lex Code consists of three parts:

{definition}
%%
{rules}
%%
{subroutine}

Definitions for lex are set before the first %%. The beginning of the definition section is indicated by% {and the end by%}. The definition section can contain local and global variable declarations, include directives, and a macro definition. Anything that is not between% {and%} lex tries to define as a regular expression declaration. To the left is the name of the regular expression, and to the right is the regular expression. The name must begin with an alphabet followed by one or more letters, numbers, bottom line or upper line. The routine represents the user’s routines, in which he can load characters from a file or standard input, print actions into a file or standard output, or overwrite some lex functions.

The first %% marks the beginning of the rule section, while the second is optional. The absolute minimum of a Lex program would be just %%, where everything from the input stream would be redirected to the output stream.

The rules represent the user’s control decisions. They are entered in tabular form, where the left column contains regular expressions, while the right column contains actions, ie. the parts of the program that execute when an expression is found in the input stream. A space or tab represents the end of a regular expression. Regular expression:

[A-For-z] + ++ numberRec;

searches for one or more consecutive letters in the input stream and increments the variable numberReads whenever the rule matches. If an action requires a single C expression, it can be placed on the right side of the rules, and if more than one are needed, then they should be placed between block brackets.

[A-Z-z] + {
      printf (“Found word”);
      numberRec ++;
}

The definition of regular expressions is very similar to QED (Quick Editor) regular expressions. The regular expression specifies a set of strings to match. Contains text characters and operator characters. Alphabet letters and numbers are always text characters. The operator characters are:

“\ [] ^ -? . * + | () $ / {}% <>

and if they are also used as text characters, they must also be inserted before /. Anything between the two quotation marks (“) is taken as a string of text characters. The most common application of quotation marks is to add a space to the expression.

Character classes can be specified in pair []. The construction [abc] matches one character, which can be a, b, or c. In large brackets, most operators play the role of text characters, while only three are special: \, -, and ^. Character – indicates range. Example:

[A-Za-z]

represents a character class containing lowercase and uppercase letters.

When the regular expression matches, the lex executes the specified actions. There are common actions, which copy from input to output. They are executed on all strings that do not match the given regular expressions. The simplest action is to ignore the input. Example:

[\ t \ n];

which will ignore spacing, tab and new row. If one rule performs the same action as the next, use | rewriting the action is avoided. For example:

\ n |
. f ();

If the first rule matches, the same action would be performed for the rule below it. In most cases, the user wants to know what a character string that looks like a regular expression looks like, so this can be achieved as follows:

[A-For-z] + printf (“%s”, yytext);

for each letter from the input stream, the lex will forward the word token to the output.

Lex knows how to handle multiple matches. When more than one regular expression is matched, the lex selects the rule as follows:

The rule that covered the most characters is the winner.
Of all the rules with the same number of characters, the winner is the first to be specified.
Now let’s set the rules:

alenblog letters ();
[a-z]+   reg ();

If the input is “alenblogs”, the first rule is “alenblog” (7 characters) and the second rule is the whole word (8 characters), so the action reg () will be executed. If the input is “alenblog”, the first and second rules will cover 7 characters, but the winner will be the first rule, and the letter function () will be executed.

Sometimes it is useful for all rules that have an input stream match to perform their action. This can be done using the REJECT action, which indicates that it is left to execute the following alternative. For example, we have the following rules:

imi i++;
alenblog    ib++;
\ n     |
. ;

If there is an “alenblog” at the entrance, the second rule will prevail because it has the most characters. Variable ibAfter executing the commands, the following command gives the executable file:

$ gcc -o calculator lex.yy.c y.tab.c -lfl

By running the calculator of the executable file, we start the calculator:

$ ./ 1 + 2 calculator will increase, but the variable will not. In the following way, we would also allow the rule to match them whenever possible:

imi ‘++;
alenblog  {ib++; REJECT;}
\ n |
. ;

Conclusion

In this article, I explained what lexare, how they can be installed on different operating systems, and how to program and run lex programs. To reiterate, lex divides the input stream into parts using given regular expressions uses those parts to find high-level templates.

If you want to get more familiar with programming in lex and yacc, the complete tutorial can be found at the following link http://dinosaur.compilertools.net/.

Links
http://dinosaur.compilertools.net/
https://www.ibm.com/developerworks/library/l-lexyac/index.html
http://epaperpress.com/lexandyacc/pry2.html

Likes:
9 0
Views:
1445
Article Categories:
PROGRAMMINGTECHNOLOGY

Leave a Reply

Your email address will not be published. Required fields are marked *