Being a programmer requires all-around knowledge of a lot of subjects. One of the most underrated among these would definitely be compiler design. Knowing what goes behind the scenes when a program is run can help one understand the source of errors and warnings (which is definitely a big help since it helps optimize the time taken to run a program successfully). Today, we’re going to be looking at one of the steps performed by a compiler before it executes the program: it’s called lexical analysis. We will understand what does a lexical analyzer do? We have also included a simple lexical analyzer program in C++, which should help you learn about how a compiler functions behind the scenes.
What is a Compiler?
Before moving to lexical analysis in C++, we need to have a basic understanding of a compiler.
Most of the programs we type in are in English (or some other language). A computer cannot understand English; it can interpret only a stream of 1’s and 0’s. In other words, binary is the only language comprehensible to a computer. Hence, it is necessary to convert these “high-level” language programs to binary before the computer can understand the instructions. That is exactly what a compiler and an interpreter are used for. They make our programs comprehensible to the computer.
Despite performing the same functions, compilers and interpreters are slightly different by nature. A compiler is normally quite huge in size, while an interpreter is considerably smaller and occupies fewer system resources. A compiler converts the entire program to binary codes at once and then executes it. If there is an error in any part of the code, the program won’t give any output. However, an interpreter converts the program to binary line-by-line while executing each line it has converted. This ensures that some output is always given before it comes across an error. As is already evident, a compiler is much faster than an interpreter. There are some more differences between them.
Most programming languages use either a compiler or an interpreter under the hood for conversion to binary code (also called “machine language”). C++ uses a compiler, while Python uses an interpreter. Some languages like Java use both. You can also check our online compiler for your programming needs.
What is Lexical Analysis?
Now, let’s understand lexical analysis in programming languages like C++. The compilation is spread across many stages. A compiler does not immediately convert a high-level language into binary – it takes time to complete! During the compilation process, the first step that is undertaken is called lexical analysis. During this process, the program typed by the user is shredded to pieces and every token that is a part of it is extracted and stored separately (tokens are the smallest indivisible parts of a program). These tokens need to be classified into particular types before the compilation process can begin.
There are several types of token which are associated with any language. The naming given by the user for several parts of the program like functions and variables is called identifiers. They are called as such because they “identify” a named storage location in the memory. Then comes keywords: a number of words used by the language for some of its functionality (In C++, these include words like cout, cin, if, else, for, break, continue, and so on). Punctuators are used for the construction of expressions and statements. They are useful only when used in conjunction with identifiers or keywords in a statement. Operators are used for performing actual operations with the data (like arithmetic, logical, and shift operations). Literals are constant data that the programs need to deal with, like numbers or alphabets (or a combination of both).
What does a lexical analyzer do? Separation of a program into its tokens and classification of the tokens is the main responsibility of the lexical analyzer.
C++ Program for Lexical Analyzer
Following is a simple lexical analyzer program in C++ programming:-
#include <fstream> #include <iostream> #include <stdlib.h> #include <string.h> #include <ctype.h> using namespace std; bool isPunctuator(char ch) //check if the given character is a punctuator or not { if (ch == ' ' || ch == '+' || ch == '-' || ch == '*' || ch == '/' || ch == ',' || ch == ';' || ch == '>' || ch == '<' || ch == '=' || ch == '(' || ch == ')' || ch == '[' || ch == ']' || ch == '{' || ch == '}' || ch == '&' || ch == '|') { return true; } return false; } bool validIdentifier(char* str) //check if the given identifier is valid or not { if (str[0] == '0' || str[0] == '1' || str[0] == '2' || str[0] == '3' || str[0] == '4' || str[0] == '5' || str[0] == '6' || str[0] == '7' || str[0] == '8' || str[0] == '9' || isPunctuator(str[0]) == true) { return false; } //if first character of string is a digit or a special character, identifier is not valid int i,len = strlen(str); if (len == 1) { return true; } //if length is one, validation is already completed, hence return true else { for (i = 1 ; i < len ; i++) //identifier cannot contain special characters { if (isPunctuator(str[i]) == true) { return false; } } } return true; } bool isOperator(char ch) //check if the given character is an operator or not { if (ch == '+' || ch == '-' || ch == '*' || ch == '/' || ch == '>' || ch == '<' || ch == '=' || ch == '|' || ch == '&') { return true; } return false; } bool isKeyword(char *str) //check if the given substring is a keyword or not { if (!strcmp(str, "if") || !strcmp(str, "else") || !strcmp(str, "while") || !strcmp(str, "do") || !strcmp(str, "break") || !strcmp(str, "continue") || !strcmp(str, "int") || !strcmp(str, "double") || !strcmp(str, "float") || !strcmp(str, "return") || !strcmp(str, "char") || !strcmp(str, "case") || !strcmp(str, "long") || !strcmp(str, "short") || !strcmp(str, "typedef") || !strcmp(str, "switch") || !strcmp(str, "unsigned") || !strcmp(str, "void") || !strcmp(str, "static") || !strcmp(str, "struct") || !strcmp(str, "sizeof") || !strcmp(str,"long") || !strcmp(str, "volatile") || !strcmp(str, "typedef") || !strcmp(str, "enum") || !strcmp(str, "const") || !strcmp(str, "union") || !strcmp(str, "extern") || !strcmp(str,"bool")) { return true; } else { return false; } } bool isNumber(char* str) //check if the given substring is a number or not { int i, len = strlen(str),numOfDecimal = 0; if (len == 0) { return false; } for (i = 0 ; i < len ; i++) { if (numOfDecimal > 1 && str[i] == '.') { return false; } else if (numOfDecimal <= 1) { numOfDecimal++; } if (str[i] != '0' && str[i] != '1' && str[i] != '2' && str[i] != '3' && str[i] != '4' && str[i] != '5' && str[i] != '6' && str[i] != '7' && str[i] != '8' && str[i] != '9' || (str[i] == '-' && i > 0)) { return false; } } return true; } char* subString(char* realStr, int l, int r) //extract the required substring from the main string { int i; char* str = (char*) malloc(sizeof(char) * (r - l + 2)); for (i = l; i <= r; i++) { str[i - l] = realStr[i]; str[r - l + 1] = '\0'; } return str; } void parse(char* str) //parse the expression { int left = 0, right = 0; int len = strlen(str); while (right <= len && left <= right) { if (isPunctuator(str[right]) == false) //if character is a digit or an alphabet { right++; } if (isPunctuator(str[right]) == true && left == right) //if character is a punctuator { if (isOperator(str[right]) == true) { std::cout<< str[right] <<" IS AN OPERATOR\n"; } right++; left = right; } else if (isPunctuator(str[right]) == true && left != right || (right == len && left != right)) //check if parsed substring is a keyword or identifier or number { char* sub = subString(str, left, right - 1); //extract substring if (isKeyword(sub) == true) { cout<< sub <<" IS A KEYWORD\n"; } else if (isNumber(sub) == true) { cout<< sub <<" IS A NUMBER\n"; } else if (validIdentifier(sub) == true && isPunctuator(str[right - 1]) == false) { cout<< sub <<" IS A VALID IDENTIFIER\n"; } else if (validIdentifier(sub) == false && isPunctuator(str[right - 1]) == false) { cout<< sub <<" IS NOT A VALID IDENTIFIER\n"; } left = right; } } return; } int main() { char c[100] = "int m = n + 3p"; parse(c); return 0; }
Conclusion
Our implementation of a C++ lexical analyzer should be enough to demonstrate how it actually works as part of the compiler. We also explained what is a compiler, interpreter, and the difference between them. Hope this helped you in understanding the lexical analysis in C++ programming. You can check some more C++ projects for beginners to practice.