I've been working on a project that requires me to create a simple search from using PHP and PostgreSQL, importing the data form XML files. The tricky part is that the XML files must contain Latin, Cyrillic, Korean and Japanese characters. I figured that if I just use UTF-8 encoding for both the XML/HTML pages and the database, everything should work just fine, and even though the non-latin characters appear all screwed up when I view them directly from the database, they actually look just fine when I get them to display on the page.
The problem comes with the searching. When I search for an English title, or anyting using latin, it works just fine, but when I enter a Cyrillic/Japanese/Korean search string, I get no results whatsoever. Any idea why that is happening and how I can fix it?
Edit: It appears this isn't over yet! Things got even weirder now. Now it works with Japanese/Korean as well, but what's weird is that it works for Cyrillic only if I copy the word directly from the XML file, but if I input it from the keyboard I get no results. This doesn't make any sense to me, and I'm even more confused now :right:
Any idea how to solve that?
P.S Here's the title copied from the XML: El5;k6;l0;k4;al3; hol4;m8;m0; l5;a j1;aol9;a
Here's the same title typed using the keyboard: h5;l5;k6;l0;k4;k2;l3; hl6;l4;m8;m0; l5;k2; j1;k2;l6;l9;k2;
I get the same result when echoing them: copied one - El5;k6;l0;k4;al3; hol4;m8;m0; l5;a j1;aol9;a, typed one - h5;l5;k6;l0;k4;k2;l3; hl6;l4;m8;m0; l5;k2; j1;k2;l6;l9;k2;.
I tried comparing them online using http://www.textdiff.com/, and the result is that they are 100% different... I'm not really sure why that is and even if they are using a different encoding or something(like windows-1251 and UTF-8), I convert both of them to UTF-8 before searching so there shouldn't be a problem with that.
I'm really at a loss here.
SOLVED: In the end it was just bad luck I guess. When trying Cyrillic I was always searching for the first entry and didn't try the others because I figured they wouldn't work as well - turned out they did and the first one was the only one that wasn't working. Thinking back I figured I typed all of the others by hand, but for this one I was lazy so I just copied the title from another site, which was obviously using a different encoding. I was converting it to UTF-8 anyway, but I guess it didn't work properly. Doesn't matter now - I just updated the XML entry and typed it myself, and everything is OK now.