video: preg_match_all cut catches on accentuated characters

vendredi 3 avril 2015

preg_match_all cut catches on accentuated characters

My objective is to collect every hashtag of a tweet-like string as:


$string = "i like to #studyéléctricité in french";
preg_match_all('/#(\w+)/',$string,$hashtags);

It captures correctly this hashtags without accents and puts them in the array $hashtags.

But with my string, it will collect only a part of the normal catch, cutting it on the first accentuated character it encounters:


var_dump(mb_detect_encoding($string));
var_dump($hashtags[0]);

it returned

string 'UTF-8' (length=5)

array (size=1) 0 => string '#study' (length=6)

Solutions tested:

1) the string is in UTF-8 so i tried specific regexes


preg_match_all('/#(\w+)/u',    $string, $hashtags);
preg_match_all('/#(pL+)/u',    $string, $hashtags);
preg_match_all('/#(p{L}+)/u',  $string, $hashtags);
preg_match_all('/#(\pL+)/u',   $string, $hashtags);
preg_match_all('/#(\p{L}+)/u', $string, $hashtags);

These all returned empty arrays.

2) i tried to change the encoding into ISO-8859-15:


$string = mb_convert_encoding($string, 'ISO-8859-15', 'UTF-8');
preg_match_all('/#(\w+)/',$string,$hashtags);
var_dump(mb_detect_encoding($string));
var_dump($hashtags[0]);

it returned:

string 'ASCII' (length=5)

array (size=1) 0 => string '#studylctricit' (length=14)

3) i tried also with iconv:


$string = iconv($string, 'UTF-8', 'ISO-8859-15');
preg_match_all('/#(\w+)/',$string,$hashtags);
var_dump(mb_detect_encoding($string));
var_dump($hashtags[0]);

it returned:

string 'ASCII' (length=5)

array (size=1) 0 => string '#study' (length=6)

How may i collect the hashtags with the accentuated characters in this situation?

I thank you by advance for any help or advice you could provide!

Jeff

video

vendredi 3 avril 2015

preg_match_all cut catches on accentuated characters

Aucun commentaire:

Enregistrer un commentaire