mardi 3 mars 2015

How to extract date from (relatively) unstructured text [R]

I'm having difficulty extracting dates from a string. The string can look one of several ways, but will always include some form of:



<full month name> <numeric date>, <year>


As in:



DECEMBER 4, 2011


However, the text at the beginning of the string ranges widely, taking forms like all of these:



THE PUBLIC SCHEDULE FOR MAYOR RAHM EMANUEL JUNE 9, 2011
THE PUBLIC SCHEDULE FOR MAYOR RAHM EMANUEL FOR OCTOBER 29 & OCTOBER 30, 2011
The Public Schedule for Mayor Rahm Emanuel December 17, 2011 through January 2, 2012
The Public Schedule for Mayor Rahm Emanuel December 8th and 9th, 2012
The Public Schedule for Mayor Rahm Emanuel – March 13, 2013


These variations are really throwing me off. Ordinarily, I would just get rid of the first X characters of the string, and use the remainder as my date, but because the formatting keeps changing this isn't possible. I have been attempting variations of this, but I end up creating dates with just as many problems.


It seems like grep() might be the function to use here, but I don't really understand how I could create a pattern which would capture these dates, or how to use its output.


Thank you for any help!


Aucun commentaire:

Enregistrer un commentaire