PCRE

Matching Patterns

@ThomasWeinert

Tasks

github.com/ThomasWeinert/workshop-pcre/tree/master/tasks

Matching: PHP Functions

  • preg_match
  • preg_match_all

preg_match

Find first match

  • preg_match($pattern, $subject);
  • preg_match($pattern, $subject, $matches);
  • preg_match($pattern, $subject, $matches, $flags, $offset);

preg_match - return values

  • Match count - 0 or 1
  • FALSE for errors

preg_match_all

Find all matches

  • preg_match_all($pattern, $subject);
  • preg_match_all($pattern, $subject, $matches);
  • preg_match_all($pattern, $subject, $matches, $flags, $offset);

preg_match_all - return values

  • Match count - 0 to n
  • FALSE for errors

$matches

  • preg_match() - array, matched groups
  • \preg_match_all() - array with
    • PREG_PATTERN_ORDER - an array for each group in the pattern
    • PREG_SET_ORDER - an array for each match

Pattern Argument

Task: Match a string

Match the string "nevercodealone". This is case sensitive.

Try different delimiters

  • ASCII
  • Letters and digits do NOT work.
  • Brackets!

Modifier

  • U - ungreedy mode
  • i - case insensitive
  • u - utf-8 mode
  • x - modifies whitespace behaviour
  • s - modifies dot behaviour
  • m - modifies anchor behaviour
  • D - modifies behaviour of $ anchor
  • ...

Task: Match a string case insensitive

The modifier i allows case insensitive matches

Match the string "code". This is case insensitive.

The Dot

  • Matches anything except a newline
  • Matches anything if modifier "s" is set
  • Escape . with \ to match an actual .

Task: Match anything but newlines

Match the string cc.cc.cc.cc. "c" can by any character except a newline.

Qualifier

What will be matched?

Define bytes/characters that will be matched.

Task: Match digits and non-digits

The qualifier \d matches any digit (0-9).

The qualifier \D matches anything except a digit.

Match the a string with the structure xxXxxXxxxx. "x" represents a digit, "X" a non digit.

Anchors

Anchor your pattern to the start and/or end of the subject.

  • ^ - string start
  • $ - string end

Task: Validate string start

The ^ anchors the pattern to the string start.

Validate that the string starts with a digit.

Task: Validate string end

The $ anchors the pattern to the string end.

Validate that the string ends with a digit.

Task: Validate a German zip code

The modifier D makes sure that a linefeed at the end of the subject is not ignored. Validate that the subject is a German zip code. It consists of 5 digits.

Modifier and Alternatives

  • Modifier m - line anchors
  • \A - string start
  • \Z - string end, ignore linefeed
  • \z - string end, recognize linefeed
  • \b - word boundary

Character Classes

  • Square Brackets: []
  • - for ranges
  • ^ for negative matches
  • many special characters lose function

Task: Match vowels

Match all the vowels (aeiou) in the string.

Task: Match non-vowels

Match all the non-vowels in the string.

Task: Validate hexadecimal bytes

Validate that the string consists of two characters. The characters can be digits or a letter between a and f.

Quantifier

How often will it be matched?

  • * - any count
  • ? - maximum of 1
  • + - minimum of 1
  • {n} - exactly n
  • {n,m} - minimum of n, maximum of m
  • {n,} - minimum of n
  • {0,m} - maximum of m

Task: Validate a German zip code

The {n} syntax allows you to match a fixed repeat of qualifiers. Validate that the subject is a German zip code. It consists of 5 digits.

Task: Validate a language code

The {n,m} syntax allows you minimum and a maximum repetitions. Validate that the subject is an 2 or 3 letter language code.

Task: Validate an integer

? matches one or none. + matches at least one repetition. Validate an integer including an optional leading sign

Unicode

The modifier u activates Unicode UTF-8 mode.

  • \X - extended unicode grapheme sequence
  • \p{xx}, \px - character with unicode property
  • \p{^xx}, \P{^xx} - character without unicode property
  • \p{script} - character from script
  • \x{FFFF} - code point

Task: Match unicode letters

Use the unicode property L to match any letter in the string "English, Русский, 中文".

Task: Match Cyrillic letters

Match any Cyrillic letter in the subject.

Groups

  • (...) - captured group
  • (?<group_name>...) - named group
  • (?:...) - group without capture
  • ((?i)...), (?i:...) - group modifiers

Task: Match date parts

Match a date in the format "YYYY-MM-DD". Capture each part into a named group (year, month, day).

Task: Check for consecutive ughs

Check if the the string contains 3 consecutive "ugh"s.

Alternatives

  • | - alternative patterns

Task: Validate title and name

Match strings that start with a title ('Mr.', 'Ms.', 'Mrs.'), followed by a space and a string that contains at least one letter.

Format and comment

  • Modifier x allows formatting
  • # - single line comment
  • (?#...) - comment group
  • \Q...\E - remove special meaning

Example: Format and comment

$pattern = '(/
  (?:[a-zA-Z\\d_-]+\\.) #title
  (?<mode>media|download|thumb)\\. # mode
  (?:(?<preview>preview)\\.)? # is preview
  (?<media_uri>
    (?<id>[A-Fa-f\\d]{32}) #id
    (?:v(?<version>\\d+))? #version
    (?:\\.[a-zA-Z\\d]+)? #extension
  )
$)Dix';

Back references

  • \1, \g{1} - reference group by index
  • (?P=name), \g{name} - reference group by name
  • \g{-1} - relative group reference

Task: Validate drunken numbers

Validate strings that consist of the any count of same digit (11, 444, ...).

Templates

  • (?(DEFINE)(?<name>...))
  • (?&name)

Task: Validate IpV4

Define a template that matches number between 0 and 255. Use the template to match an IP.

Pattern: IpV4

$pattern = '(^
  (?:(?&number)\\.){3}(?&number)
  (?(DEFINE)
    (?<number>
      25[0-5]| # 250 - 255
      2[0-4]\\d| # 200 - 249
      1?\\d{1,2} # 0 - 199
    )
  )
$)Dx';

Assertions

  • (?=...), (?!...) - Lookahead
  • (?<=...), (?<!...) - Lookbehind

Links