From regular expressions to deterministic automata

Gerard Berry, Ravi Sethi

Research output: Contribution to journalArticle

173 Citations (Scopus)

Abstract

The main theorem allows an elegant algorithm to be refined into an efficient one. The elegant algorithm for constructing a finite automaton from a regular expression is based on 'derivatives of' regular expressions; the efficient algorithm is based on 'marking of' regular expressions. Derivatives of regular expressions correspond to state transitions in finite automata. When a finite automaton makes a transition under input symbol a, a leading a is stripped from the remaining input. Correspondingly, if the input string is generated by a regular expression E, then the derivative of E by a generates the remaining input after a leading a is stripped. Brzozowski (1964) used derivatives to construct finite automata; the state for expression E has a transition under a to the state for the derivative of E by a. This approach extends to regular expressions with new operators, including intersection and complement; however, explicit computation of derivatives can be expensive. Marking of regular expressions yields an expression with distinct input symbols. Following McNaughton and Yamada (1960), we attach subscripts to each input symbol in an expression; (ab + b)*ba becomes (a1b2+b3)*b4a5. Conceptually, the efficient algorithm constructs an automaton for the marked expression. The marks on the transitions are then erased, resulting in a nondeterministic automaton for the original unmarked expression. This approach works for the usual operations of union, concatenation, and iteration; however, intersection and complement cannot be handled because marking and unmarking do not preserve the languages generated by regular expressions with these operators.

Original languageEnglish (US)
Pages (from-to)117-126
Number of pages10
JournalTheoretical Computer Science
Volume48
Issue numberC
DOIs
StatePublished - 1986
Externally publishedYes

Fingerprint

Regular Expressions
Automata
Finite automata
Derivatives
Finite Automata
Derivative
Efficient Algorithms
Complement
Intersection
Subscript
Concatenation
State Transition
Operator
Union
Strings
Distinct
Iteration
Theorem

ASJC Scopus subject areas

  • Computational Theory and Mathematics

Cite this

From regular expressions to deterministic automata. / Berry, Gerard; Sethi, Ravi.

In: Theoretical Computer Science, Vol. 48, No. C, 1986, p. 117-126.

Research output: Contribution to journalArticle

@article{9fdb9abff44e4a5e800eadc4c0eb06b2,
title = "From regular expressions to deterministic automata",
abstract = "The main theorem allows an elegant algorithm to be refined into an efficient one. The elegant algorithm for constructing a finite automaton from a regular expression is based on 'derivatives of' regular expressions; the efficient algorithm is based on 'marking of' regular expressions. Derivatives of regular expressions correspond to state transitions in finite automata. When a finite automaton makes a transition under input symbol a, a leading a is stripped from the remaining input. Correspondingly, if the input string is generated by a regular expression E, then the derivative of E by a generates the remaining input after a leading a is stripped. Brzozowski (1964) used derivatives to construct finite automata; the state for expression E has a transition under a to the state for the derivative of E by a. This approach extends to regular expressions with new operators, including intersection and complement; however, explicit computation of derivatives can be expensive. Marking of regular expressions yields an expression with distinct input symbols. Following McNaughton and Yamada (1960), we attach subscripts to each input symbol in an expression; (ab + b)*ba becomes (a1b2+b3)*b4a5. Conceptually, the efficient algorithm constructs an automaton for the marked expression. The marks on the transitions are then erased, resulting in a nondeterministic automaton for the original unmarked expression. This approach works for the usual operations of union, concatenation, and iteration; however, intersection and complement cannot be handled because marking and unmarking do not preserve the languages generated by regular expressions with these operators.",
author = "Gerard Berry and Ravi Sethi",
year = "1986",
doi = "10.1016/0304-3975(86)90088-5",
language = "English (US)",
volume = "48",
pages = "117--126",
journal = "Theoretical Computer Science",
issn = "0304-3975",
publisher = "Elsevier",
number = "C",

}

TY - JOUR

T1 - From regular expressions to deterministic automata

AU - Berry, Gerard

AU - Sethi, Ravi

PY - 1986

Y1 - 1986

N2 - The main theorem allows an elegant algorithm to be refined into an efficient one. The elegant algorithm for constructing a finite automaton from a regular expression is based on 'derivatives of' regular expressions; the efficient algorithm is based on 'marking of' regular expressions. Derivatives of regular expressions correspond to state transitions in finite automata. When a finite automaton makes a transition under input symbol a, a leading a is stripped from the remaining input. Correspondingly, if the input string is generated by a regular expression E, then the derivative of E by a generates the remaining input after a leading a is stripped. Brzozowski (1964) used derivatives to construct finite automata; the state for expression E has a transition under a to the state for the derivative of E by a. This approach extends to regular expressions with new operators, including intersection and complement; however, explicit computation of derivatives can be expensive. Marking of regular expressions yields an expression with distinct input symbols. Following McNaughton and Yamada (1960), we attach subscripts to each input symbol in an expression; (ab + b)*ba becomes (a1b2+b3)*b4a5. Conceptually, the efficient algorithm constructs an automaton for the marked expression. The marks on the transitions are then erased, resulting in a nondeterministic automaton for the original unmarked expression. This approach works for the usual operations of union, concatenation, and iteration; however, intersection and complement cannot be handled because marking and unmarking do not preserve the languages generated by regular expressions with these operators.

AB - The main theorem allows an elegant algorithm to be refined into an efficient one. The elegant algorithm for constructing a finite automaton from a regular expression is based on 'derivatives of' regular expressions; the efficient algorithm is based on 'marking of' regular expressions. Derivatives of regular expressions correspond to state transitions in finite automata. When a finite automaton makes a transition under input symbol a, a leading a is stripped from the remaining input. Correspondingly, if the input string is generated by a regular expression E, then the derivative of E by a generates the remaining input after a leading a is stripped. Brzozowski (1964) used derivatives to construct finite automata; the state for expression E has a transition under a to the state for the derivative of E by a. This approach extends to regular expressions with new operators, including intersection and complement; however, explicit computation of derivatives can be expensive. Marking of regular expressions yields an expression with distinct input symbols. Following McNaughton and Yamada (1960), we attach subscripts to each input symbol in an expression; (ab + b)*ba becomes (a1b2+b3)*b4a5. Conceptually, the efficient algorithm constructs an automaton for the marked expression. The marks on the transitions are then erased, resulting in a nondeterministic automaton for the original unmarked expression. This approach works for the usual operations of union, concatenation, and iteration; however, intersection and complement cannot be handled because marking and unmarking do not preserve the languages generated by regular expressions with these operators.

UR - http://www.scopus.com/inward/record.url?scp=0022989344&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0022989344&partnerID=8YFLogxK

U2 - 10.1016/0304-3975(86)90088-5

DO - 10.1016/0304-3975(86)90088-5

M3 - Article

VL - 48

SP - 117

EP - 126

JO - Theoretical Computer Science

JF - Theoretical Computer Science

SN - 0304-3975

IS - C

ER -