Finding names on a raw text

Sometimes is difficult to find out names on a text. Maybe the most naïve way is to get all the words that starts with a capital letter and that’s it! But, it you check on this paragraph you could find names like «Maybe» or «But» (???) So, fortunately, there’re more brilliant ideas like this, on which is used regex with some particular rules, like:

  • A name is composed by two word (minimum) that starts with a capital letter each one.
  • Maybe can be composed by more than two words, like «James Van de Putte» or something similar.
  • Multiple words separated by whitespace.
  • … and so.

This is the final regex string used to parse names (namely, composed names) from a text.

[A-Z]([a-z]+|\.)(?:\s+[A-Z]([a-z]+|\.))*(?:\s+[a-z][a-z\-]+){0,2}\s+[A-Z]([a-z]+|\.)

 

Esta entrada fue publicada en Varios y etiquetada , , . Guarda el enlace permanente.

Deja un comentario

Tu dirección de correo electrónico no será publicada. Los campos obligatorios están marcados con *