Jump to content


Win32 reentrant LEX / FLEX ?


5 replies to this topic

#1 Wernaeh

    Senior Member

  • Members
  • PipPipPipPip
  • 368 posts

Posted 26 April 2009 - 03:13 PM

Hello everyone :)

Currently, I'm tasked with updating an old company project. In particular, I am to update it so it conforms to several coding standards.

One major component of said project consists of various different lexical analyzers for various file types (for a C-like scripting language, various data formats, ...)

All of these up to now have been handwritten in readable, clear, and simple C++ - a state class with a per-character link array that specifies consequent states of the language automaton, and a manually written method that creates all states and links. The resulting lexers don't do too much per char, an if, a table lookup, and a few integer adds, although states are randomly distributed (newed) within memory.

Now, the goal was to replace these lexer classes with a lex/flex generated lexer, mainly for easier maintenance (standards...), with a second goal of later on adding a yacc/bison parser atop for several special input files, and a third goal of vastly improved performance.

However, there's a few problems, and I'm unsure on how to deal with these:

- The lexers need to be reentrant, since they are used from multiple different threads. However, FLEX doesn't support building reentrant lexers on the most recent build for win32. I've tried to compile an uptodate FLEX. Yet, newer version require *nix fork() and wait(), so porting would also be necessary. This seems like an uneccesary effort - if it doesn't work out of the box, it obviously isn't a standard step in applications building anymore.

- Using an Unix version of FLEX also is out of question, since it doesn't integrate into the existing Win32 build process too well.

- I'm uncertain whether a FLEX lexer really brings any performance benefit over the handwritten one - after all, FLEX programs generally do look really messy, somehow, and just from looking at the core loop, I think it has much more operations than the simple lexer currently in place. The same goes for the reentrant option within FLEX: The current C++ lexers obviously are reentrant, since properly capsuled up within their classes, but it's kinda hard to trust FLEX lexers there.

- Are there Win32 alternatives to FLEX that also syntactically are compatible to normal LEX ? Or use a FLEX within cygwin maybe ?

- Or should I just leave the old lexer class in place and somehow try to merge it with a Bison parser later on ? Seems like the most clean solution - but it isn't _standard_ (god do I hate this term right now... :) )

What would you do ?
Any further ideas / pointers / input ?

Thank you for your time,
Cheers,
- Wernaeh
Some call me mathematician, some just call me computer guy. Yet, I prefer the term professional weirdo :)

#2 .oisyn

    DevMaster Staff

  • Moderators
  • 1822 posts

Posted 26 April 2009 - 03:30 PM

Flex C++ is reentrant as you say, so I don't really understand your question. Is it that you don't want to use flex or C++?

Also, does it need to be lex or flex? Because there are tonnes of alternatives available :P
C++ addict
-
Currently working on: the 3D engine for Tomb Raider.

#3 Wernaeh

    Senior Member

  • Members
  • PipPipPipPip
  • 368 posts

Posted 26 April 2009 - 04:11 PM

A okay - got half the problem solved :)
I'm now going with Cygwin, there is an up-to-date flex exe available there, and it still integrates into our build process quite nicely.

Quote

Flex C++ is reentrant as you say
Yup, if you make it compile a C++ lexer it is.

If you use FLEX to compile a C lexer, it isn't - unless you have the option -R or %option reentrant, which is available within FLEX v2.5.35, but not with the old version v2.5.4. Sadly, the latter one is the most recent one available directly for windows within the gnu32 project. I just found out newer versions are available through Cygwin, though. Just a little bit more stuff to download, but they work :)

Quote

Is it that you don't want to use flex or C++?

I officially should use FLEX, that's the first idea that came to our mind when it came to standards for the lexer components.

However, I'm actually rather fond of the previous hand written C++ lexers (because you actually understand what the code is doing at a single glance...)

I also am fond of the FLEX input files (nice and clean :) ), but not so fond of the FLEX C output (lots of macros, gives a series of warnings with MSVC 2003, ...)

I don't want to use the FLEX C++ output, for reasons similar to the ones detailed i
Some call me mathematician, some just call me computer guy. Yet, I prefer the term professional weirdo :)

#4 Wernaeh

    Senior Member

  • Members
  • PipPipPipPip
  • 368 posts

Posted 26 April 2009 - 09:06 PM

Dammit the forum gnoll ate half of my post editing...

Well, anyways, here is the sequel to the text above:

I don't want to use the FLEX C++ output, for reasons similar to the ones detailed in this article.
After all, the comment

Quote

The c++ scanner is a mess. The FlexLexer.h header file relies on the
following macro. This is required in order to pass the
c++-multiple-scanners test in the regression suite. We get reports
that it breaks inheritance. We will address this in a future release
of flex, or omit the C++ scanner altogether.
within the C++ scanner generated by flex does not exactly earn my trust.

However, I'm basically open to alternative ideas, as long as there is some reasoning behind them - thus, I'm currently just collecting argument points for either a FLEX based parser, our old, handwritten parser, or an alternative solution.

Requirements are:
+ Scanner and scanner generator integrate into MS VC 2003
+ Reentrant behaviour
+ License that allows the scanner to be integrated into closed-source software

Thank you for the comparison list you posted - I didn't even know there were such things on the wikipedia... I'm currently looking into RE2C and Quelex, both loook quite promising.

Cheers,
- Wernaeh
Some call me mathematician, some just call me computer guy. Yet, I prefer the term professional weirdo :)

#5 SamuraiCrow

    Senior Member

  • Members
  • PipPipPipPip
  • 459 posts

Posted 27 April 2009 - 01:52 AM

If you want readable C++ parser code, I'd consider using YARD. The rules for using a Parsing Expression Grammar (PEG) as opposed to Bakkus-Naur Format (BNF) are quite compelling.

First of all, PEGs are based on regular expressions and allow all sorts of nice repeat-states and optional states. Secondly, PEGs are never ambiguous since they are based on a greedy algorithm (rules are prioritized first to last). Thirdly, if you ever need to add UNICODE support to your parser, you can forget about LEX and YACC and their derivatives completely due to memory constraints of the arrays having to have more than 127 ASCII codes of column width.

The downsides of using YARD are as follows: The greedy algorithms of some states degenerate into a linear search as opposed to an array lookup in FLEX having constant time. Since the YARD parser generator is based on templates, the number of parameters you can pass to each rule is limited to about 10 although this is easy to work around.

If you decide to use YARD (or any other PEG parser generator) and need help, just let me know or post to the forums on the YARD website above.

-edit-
I forgot to mention you don't need a separate scanner and parser in a PEG.

#6 .oisyn

    DevMaster Staff

  • Moderators
  • 1822 posts

Posted 27 April 2009 - 11:45 AM

I'm currently using Coco/R for implementing my scripting language in C++. Aside from the fact that it's a more traditional parser generator that transforms a formal definition into a sourcefile, I think it's very comparable to YARD. It also generates a recursive descent parser, and it uses an EBNF notation for repeats and optional sequences. You also define the scanner and the parser in a single file.

The only downside is that the C++ generator makes no use of the STL whatsoever. You can't feed it an istream, only a FILE* or a char*. Not even a wchar_t*, even though it uses wchar_t strings internally. Fortunately, it's quite easy to create your own templates so this is not unsolvable.
C++ addict
-
Currently working on: the 3D engine for Tomb Raider.





1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users