vendredi 3 avril 2015

preg_match_all cut catches on accentuated characters

My objective is to collect every hashtag of a tweet-like string as:



$string = "i like to #studyéléctricité in french";
preg_match_all('/#(\w+)/',$string,$hashtags);


It captures correctly this hashtags without accents and puts them in the array $hashtags.


But with my string, it will collect only a part of the normal catch, cutting it on the first accentuated character it encounters:



var_dump(mb_detect_encoding($string));
var_dump($hashtags[0]);


it returned



string 'UTF-8' (length=5)


array (size=1) 0 => string '#study' (length=6)



Solutions tested:


1) the string is in UTF-8 so i tried specific regexes



preg_match_all('/#(\w+)/u', $string, $hashtags);
preg_match_all('/#(pL+)/u', $string, $hashtags);
preg_match_all('/#(p{L}+)/u', $string, $hashtags);
preg_match_all('/#(\pL+)/u', $string, $hashtags);
preg_match_all('/#(\p{L}+)/u', $string, $hashtags);


These all returned empty arrays.


2) i tried to change the encoding into ISO-8859-15:



$string = mb_convert_encoding($string, 'ISO-8859-15', 'UTF-8');
preg_match_all('/#(\w+)/',$string,$hashtags);
var_dump(mb_detect_encoding($string));
var_dump($hashtags[0]);


it returned:



string 'ASCII' (length=5)


array (size=1) 0 => string '#studylctricit' (length=14)



3) i tried also with iconv:



$string = iconv($string, 'UTF-8', 'ISO-8859-15');
preg_match_all('/#(\w+)/',$string,$hashtags);
var_dump(mb_detect_encoding($string));
var_dump($hashtags[0]);


it returned:



string 'ASCII' (length=5)


array (size=1) 0 => string '#study' (length=6)



How may i collect the hashtags with the accentuated characters in this situation?


I thank you by advance for any help or advice you could provide!


Jeff


Aucun commentaire:

Enregistrer un commentaire