C# regular expressions (in lexical analysis)

A place to discuss the implementation and style of computer programs.

Moderators: phlip, Moderators General, Prelates

User avatar
Marvin
Posts: 153
Joined: Thu Nov 22, 2007 8:03 am UTC
Location: Croatia
Contact:

C# regular expressions (in lexical analysis)

Postby Marvin » Sun Jan 20, 2008 5:15 pm UTC

i need to know whether a series of characters has potential to match a regular expression, for example:
bbbbbb has potential to match (b+a), while bbbc doesn't

the regex class only tells whether there is a match or there isn't (or a i haven't got it right), and i don't see a (simple) way to do that... so, any ideas?
42
--
If God intended us to program we would be born with serial I/O ports.

User avatar
Cabhan
Posts: 67
Joined: Wed Aug 29, 2007 4:34 am UTC
Location: Boston, MA, USA
Contact:

Re: C# regular expressions (in lexical analysis)

Postby Cabhan » Mon Jan 21, 2008 4:46 am UTC

Excluding C# specifically, your question makes no sense to me. For instance, in your examples, neither of those matches.

Maybe you can explain what you mean by potential? I mean, a string either matches a regular expression or it doesn't. There's really not any grey area here.

User avatar
Marvin
Posts: 153
Joined: Thu Nov 22, 2007 8:03 am UTC
Location: Croatia
Contact:

Re: C# regular expressions (in lexical analysis)

Postby Marvin » Mon Jan 21, 2008 5:02 am UTC

Cabhan wrote:Excluding C# specifically, your question makes no sense to me. For instance, in your examples, neither of those matches.

Maybe you can explain what you mean by potential? I mean, a string either matches a regular expression or it doesn't. There's really not any grey area here.


Marvin wrote:bbbbbb has potential to match (b+a), while bbbc doesn't


if you add a at the end of bbbbbb, to make bbbbbba, then you have a match, so it has potential for match, while if you add anything at the end of bbbc, there is no possibility for match with (b+a)

thats what i mean by potential
42
--
If God intended us to program we would be born with serial I/O ports.

User avatar
Cabhan
Posts: 67
Joined: Wed Aug 29, 2007 4:34 am UTC
Location: Boston, MA, USA
Contact:

Re: C# regular expressions (in lexical analysis)

Postby Cabhan » Mon Jan 21, 2008 5:38 am UTC

Alrighty. So after some consultation with my roommate, we have decided that this is very definitely possible. However, we are not so certain on whether or not its implementation exists (having said that, I have never used C#, so I'm not exactly an authority on that domain). I found someone who asked the same question in 2005, but he never got an answer, and left no e-mail address.

If you can't find an implementation, the basic outline would be:

1) Translate the regular expression into its equivalent NFA
2) Translate the NFA into its equivalent DFA
3) Treat the DFA as a graph, and basically try to see if you can reach any accept state from the point at which your regular expression matches so far

The first 2 steps are rather simple, and many implementations of this process exist. And at that point, the 3rd shouldn't be too tough. However, having never tried to code such a thing, I couldn't speak on it personally. Also, depending on the complexity of the regular expression, this could be horribly inefficient.

If you find a solution, I'd be very interested in it.

User avatar
Marvin
Posts: 153
Joined: Thu Nov 22, 2007 8:03 am UTC
Location: Croatia
Contact:

Re: C# regular expressions (in lexical analysis)

Postby Marvin » Mon Jan 21, 2008 5:55 am UTC

Cabhan wrote:Alrighty. So after some consultation with my roommate, we have decided that this is very definitely possible. However, we are not so certain on whether or not its implementation exists (having said that, I have never used C#, so I'm not exactly an authority on that domain). I found someone who asked the same question in 2005, but he never got an answer, and left no e-mail address.

If you can't find an implementation, the basic outline would be:

1) Translate the regular expression into its equivalent NFA
2) Translate the NFA into its equivalent DFA
3) Treat the DFA as a graph, and basically try to see if you can reach any accept state from the point at which your regular expression matches so far

The first 2 steps are rather simple, and many implementations of this process exist. And at that point, the 3rd shouldn't be too tough. However, having never tried to code such a thing, I couldn't speak on it personally. Also, depending on the complexity of the regular expression, this could be horribly inefficient.

If you find a solution, I'd be very interested in it.

actually, it's not even that complicated, you make NFA
define 4 types of states, start, end, normal and sink, where sink state is any that cannot lead to end state
and just feed the NFA with characters, and when your NFA comes into sink state then you have no more potential, and while your NFA is in normal state you have potential, until it reaches end state or sink state

now that you think, "so where the hell is the problem", it's converting regular expression into epsilon-NFA and simulating it... it's possible, it's not even that complicated, but i'd like to avoid it... and i prefer not to reinvent the wheel...
42
--
If God intended us to program we would be born with serial I/O ports.

EvanED
Posts: 4331
Joined: Mon Aug 07, 2006 6:28 am UTC
Location: Madison, WI
Contact:

Re: C# regular expressions (in lexical analysis)

Postby EvanED » Mon Jan 21, 2008 6:03 am UTC

Cabhan wrote:Alrighty. So after some consultation with my roommate, we have decided that this is very definitely possible. However, we are not so certain on whether or not its implementation exists (having said that, I have never used C#, so I'm not exactly an authority on that domain). I found someone who asked the same question in 2005, but he never got an answer, and left no e-mail address.

If you can't find an implementation, the basic outline would be:

1) Translate the regular expression into its equivalent NFA
2) Translate the NFA into its equivalent DFA
3) Treat the DFA as a graph, and basically try to see if you can reach any accept state from the point at which your regular expression matches so far

The first 2 steps are rather simple, and many implementations of this process exist. And at that point, the 3rd shouldn't be too tough.

Under the typical construction of NFAs from an RE, IIRC, it will be possible to reach an accepting state from any state in the NFA. So if you were to follow the NFA's execution, as long as you are never in the situation where you are currently at no state, the read portion of the input is a prefix of a matching string.

In that sense every state in your NFA is accepting.

But anyway, that's exactly what I'd do if I couldn't find anything by searching "regular expression prefix" and similar things.

Marvin wrote:define 4 types of states, start, end, normal and sink, where sink state is any that cannot lead to end state
and just feed the NFA with characters, and when your NFA comes into sink state then you have no more potential, and while your NFA is in normal state you have potential, until it reaches end state or sink state

now that you think, "so where the hell is the problem", it's converting regular expression into epsilon-NFA and simulating it... it's possible, it's not even that complicated, but i'd like to avoid it... and i prefer not to reinvent the wheel...

I think you said what he just said, maybe not doing the NFA -> DFA translation. ;-)

User avatar
Marvin
Posts: 153
Joined: Thu Nov 22, 2007 8:03 am UTC
Location: Croatia
Contact:

Re: C# regular expressions (in lexical analysis)

Postby Marvin » Mon Jan 21, 2008 6:30 am UTC

EvanED wrote:Under the typical construction of NFAs from an RE, IIRC, it will be possible to reach an accepting state from any state in the NFA. So if you were to follow the NFA's execution, as long as you are never in the situation where you are currently at no state, the read portion of the input is a prefix of a matching string.

In that sense every state in your NFA is accepting.

hmmm... well... yes and no, you can reach accepting state from any state of NFA, but for sake of programming you define sink state as state where your NFA goes when there is no state, easier to implement IMO (or at least that was my idea of implementing it), and not every state is accepting, only when you come into final state, then is character sequence from input accepted, but we could talk of 2 levels of acceptance here, because i'm interested in both, whether there is potential, and whether the potential is reached (final state).

EvanED wrote:I think you said what he just said, maybe not doing the NFA -> DFA translation. ;-)


well yes... it's true :mrgreen:
42
--
If God intended us to program we would be born with serial I/O ports.

Rysto
Posts: 1460
Joined: Wed Mar 21, 2007 4:07 am UTC

Re: C# regular expressions (in lexical analysis)

Postby Rysto » Mon Jan 21, 2008 8:02 am UTC

All that you have to start at the right of your regular expression and start removing subexpressions. For your example, you want:

a+b|a+|

EvanED
Posts: 4331
Joined: Mon Aug 07, 2006 6:28 am UTC
Location: Madison, WI
Contact:

Re: C# regular expressions (in lexical analysis)

Postby EvanED » Mon Jan 21, 2008 8:13 am UTC

I bet you could come up with something without too much trouble, but it's a little more complicated then just saying take a prefix of the RE. For instance, in (ab+)+c, I think the following covers all matching prefixes:

Code: Select all

 (ab+)+c   |
 (ab+)+    |
 (ab+)*a
 <empty>

I don't think you could have a smaller set. Alternately, you could replace the second to last one above with

Code: Select all

(ab+)+a    |
 a

and have something an automatic process may be more likely to construct.

If you are interested, then working out exactly what the rules should be would be interesting.

For instance, I'm thinking something like this if you can get past my notation:

Code: Select all

Prefixes( re1 | re2 ) = Prefixes(re1) | Prefixes(re2)
Prefixes( re1 re2 )   = Prefixes(re1) | re1 (Prefixes(re2))
Prefixes( (re1)* )    = (re1)* (Prefixes(re1))
Prefixes( (re1)+ )    = Prefixes( (re1)* )

but verifying this to the point where I would trust it the same as I would the previous solution with the NFA would take some work that at least I am not willing to put in now. ;-)

User avatar
Marvin
Posts: 153
Joined: Thu Nov 22, 2007 8:03 am UTC
Location: Croatia
Contact:

Re: C# regular expressions (in lexical analysis)

Postby Marvin » Mon Jan 21, 2008 10:40 am UTC

i think i'll just go with reinventing the wheel...
42
--
If God intended us to program we would be born with serial I/O ports.

User avatar
Cabhan
Posts: 67
Joined: Wed Aug 29, 2007 4:34 am UTC
Location: Boston, MA, USA
Contact:

Re: C# regular expressions (in lexical analysis)

Postby Cabhan » Mon Jan 21, 2008 5:06 pm UTC

EvanED wrote:Under the typical construction of NFAs from an RE, IIRC, it will be possible to reach an accepting state from any state in the NFA. So if you were to follow the NFA's execution, as long as you are never in the situation where you are currently at no state, the read portion of the input is a prefix of a matching string.

In that sense every state in your NFA is accepting.

But anyway, that's exactly what I'd do if I couldn't find anything by searching "regular expression prefix" and similar things.


Right. Well, it depends on how you look at it. I was thinking more along the lines of DFAs at that point, so you could end up in a point where you can't reach the accept state.

However, rethinking it with specifically NFAs (I don't know why I wanted to use DFAs earlier), you are basically modifying an NFA s.t. you have the following: (E, Q, q0, d, R), where R is the set of rejecting states. Then, after traversing your NFA, if you are in a state in R, then you do not have potential to match. Otherwise, you do (either you already match, or you could match).

Marvin, I like the sink state idea: basically, when there is no transition, go to sink state. Sink state rejects, accept states accept, and any other situation means potential.

Fun!

User avatar
Marvin
Posts: 153
Joined: Thu Nov 22, 2007 8:03 am UTC
Location: Croatia
Contact:

Re: C# regular expressions (in lexical analysis)

Postby Marvin » Mon Jan 21, 2008 5:43 pm UTC

Cabhan wrote:Marvin, I like the sink state idea: basically, when there is no transition, go to sink state. Sink state rejects, accept states accept, and any other situation means potential.

Fun!

i have never thought i'd ever say that... but yeah... automata can be fun :D
(i'll have to post more often here, talking with someone about it makes me think more clearly about the problem (or just makes me think :D))

EDIT: the first part of reinventing the wheel is done, the NFA, now to the second, trickier part, turning regular expressions into NFA...
42
--
If God intended us to program we would be born with serial I/O ports.


Return to “Coding”

Who is online

Users browsing this forum: No registered users and 6 guests