PCRE

Matching Patterns

@ThomasWeinert

Tasks

github.com/ThomasWeinert/workshop-pcre/tree/master/tasks

Matching: PHP Functions

preg_match
preg_match_all

preg_match

Find first match

preg_match($pattern, $subject);
preg_match($pattern, $subject, $matches);
preg_match($pattern, $subject, $matches, $flags, $offset);

preg_match - return values

Match count - 0 or 1
FALSE for errors

preg_match_all

Find all matches

preg_match_all($pattern, $subject);
preg_match_all($pattern, $subject, $matches);
preg_match_all($pattern, $subject, $matches, $flags, $offset);

preg_match_all - return values

Match count - 0 to n
FALSE for errors

$matches

preg_match() - array, matched groups
\preg_match_all() - array with
- PREG_PATTERN_ORDER - an array for each group in the pattern
- PREG_SET_ORDER - an array for each match

Pattern Argument

Task: Match a string

Match the string "nevercodealone". This is case sensitive.

Try different delimiters

ASCII
Letters and digits do NOT work.
Brackets!

Modifier

U - ungreedy mode
i - case insensitive
u - utf-8 mode
x - modifies whitespace behaviour
s - modifies dot behaviour
m - modifies anchor behaviour
D - modifies behaviour of $ anchor
...

Task: Match a string case insensitive

The modifier i allows case insensitive matches

Match the string "code". This is case insensitive.

The Dot

Matches anything except a newline
Matches anything if modifier "s" is set
Escape . with \ to match an actual .

Task: Match anything but newlines

Match the string cc.cc.cc.cc. "c" can by any character except a newline.

Qualifier

What will be matched?

Define bytes/characters that will be matched.

Task: Match digits and non-digits

The qualifier \d matches any digit (0-9).

The qualifier \D matches anything except a digit.

Match the a string with the structure xxXxxXxxxx. "x" represents a digit, "X" a non digit.

Anchors

Anchor your pattern to the start and/or end of the subject.

^ - string start
$ - string end

Task: Validate string start

The ^ anchors the pattern to the string start.

Validate that the string starts with a digit.

Task: Validate string end

The $ anchors the pattern to the string end.

Validate that the string ends with a digit.

Task: Validate a German zip code

The modifier D makes sure that a linefeed at the end of the subject is not ignored. Validate that the subject is a German zip code. It consists of 5 digits.

$pattern = '(^\d\d\d\d\d$)D';

Modifier and Alternatives

Modifier m - line anchors
\A - string start
\Z - string end, ignore linefeed
\z - string end, recognize linefeed
\b - word boundary

Character Classes

Square Brackets: []
- for ranges
^ for negative matches
many special characters lose function

Task: Match vowels

Match all the vowels (aeiou) in the string.

Task: Match non-vowels

Match all the non-vowels in the string.

Task: Validate hexadecimal bytes

Validate that the string consists of two characters. The characters can be digits or a letter between a and f.

Quantifier

How often will it be matched?

* - any count
? - maximum of 1
+ - minimum of 1
{n} - exactly n
{n,m} - minimum of n, maximum of m
{n,} - minimum of n
{0,m} - maximum of m

Task: Validate a German zip code

The {n} syntax allows you to match a fixed repeat of qualifiers. Validate that the subject is a German zip code. It consists of 5 digits.

$pattern = '(^\\d{5}$)D';

Task: Validate a language code

The {n,m} syntax allows you minimum and a maximum repetitions. Validate that the subject is an 2 or 3 letter language code.

Task: Validate an integer

? matches one or none. + matches at least one repetition. Validate an integer including an optional leading sign

Unicode

The modifier u activates Unicode UTF-8 mode.

\X - extended unicode grapheme sequence
\p{xx}, \px - character with unicode property
\p{^xx}, \P{^xx} - character without unicode property
\p{script} - character from script
\x{FFFF} - code point

Task: Match unicode letters

Use the unicode property L to match any letter in the string "English, Русский, 中文".

Task: Match Cyrillic letters

Match any Cyrillic letter in the subject.

Groups

(...) - captured group
(?<group_name>...) - named group
(?:...) - group without capture
((?i)...), (?i:...) - group modifiers

Task: Match date parts

Match a date in the format "YYYY-MM-DD". Capture each part into a named group (year, month, day).

Task: Check for consecutive ughs

Check if the the string contains 3 consecutive "ugh"s.

Alternatives

| - alternative patterns

Task: Validate title and name

Match strings that start with a title ('Mr.', 'Ms.', 'Mrs.'), followed by a space and a string that contains at least one letter.

Format and comment

Modifier x allows formatting
# - single line comment
(?#...) - comment group
\Q...\E - remove special meaning

Example: Format and comment

$pattern = '(/
  (?:[a-zA-Z\\d_-]+\\.) #title
  (?<mode>media|download|thumb)\\. # mode
  (?:(?<preview>preview)\\.)? # is preview
  (?<media_uri>
    (?<id>[A-Fa-f\\d]{32}) #id
    (?:v(?<version>\\d+))? #version
    (?:\\.[a-zA-Z\\d]+)? #extension
  )
$)Dix';

Back references

\1, \g{1} - reference group by index
(?P=name), \g{name} - reference group by name
\g{-1} - relative group reference

Task: Validate drunken numbers

Validate strings that consist of the any count of same digit (11, 444, ...).

Templates

(?(DEFINE)(?<name>...))
(?&name)

Task: Validate IpV4

Define a template that matches number between 0 and 255. Use the template to match an IP.

Pattern: IpV4

$pattern = '(^
  (?:(?&number)\\.){3}(?&number)
  (?(DEFINE)
    (?<number>
      25[0-5]| # 250 - 255
      2[0-4]\\d| # 200 - 249
      1?\\d{1,2} # 0 - 199
    )
  )
$)Dx';

Assertions

(?=...), (?!...) - Lookahead
(?<=...), (?<!...) - Lookbehind

PCRE

Matching Patterns

Tasks

Matching: PHP Functions

preg_match

preg_match - return values

preg_match_all

preg_match_all - return values

$matches

Pattern Argument

Task: Match a string

Try different delimiters

Modifier

Task: Match a string case insensitive

The Dot

Task: Match anything but newlines

Qualifier

Task: Match digits and non-digits

Anchors

Task: Validate string start

Task: Validate string end

Task: Validate a German zip code

Modifier and Alternatives

Character Classes

Task: Match vowels

Task: Match non-vowels

Task: Validate hexadecimal bytes

Quantifier

Task: Validate a German zip code

Task: Validate a language code

Task: Validate an integer

Unicode

Task: Match unicode letters

Task: Match Cyrillic letters

Groups

Task: Match date parts

Task: Check for consecutive ughs

Alternatives

Task: Validate title and name

Format and comment

Example: Format and comment

Back references

Task: Validate drunken numbers

Templates

Task: Validate IpV4

Pattern: IpV4

Assertions

Links