Follow us on Twitter!
Your life is ending one minute at a time. If you were to die tomorrow, what would you do today?
Sunday, April 20, 2014
Navigation
Home
HellBoundHackers Main:
HellBoundHackers Find:
HellBoundHackers Information:
Learn
Communicate
Submit
Shop
Challenges
HellBoundHackers Exploit:
HellBoundHackers Programming:
HellBoundHackers Think:
HellBoundHackers Track:
HellBoundHackers Patch:
HellBoundHackers Other:
HellBoundHackers Need Help?
Other
Members Online
Total Online: 31
Guests Online: 30
Members Online: 1

Registered Members: 82850
Newest Member: hardstylurr
Latest Articles
View Thread

HellBound Hackers | Computer General | Programming

Author

PostgreSQL Non-Latin Characters

BrandonHeat
Member



Posts: 12
Location:
Joined: 26.02.11
Rank:
Guest
Posted on 23-05-11 20:40
I've been working on a project that requires me to create a simple search from using PHP and PostgreSQL, importing the data form XML files. The tricky part is that the XML files must contain Latin, Cyrillic, Korean and Japanese characters. I figured that if I just use UTF-8 encoding for both the XML/HTML pages and the database, everything should work just fine, and even though the non-latin characters appear all screwed up when I view them directly from the database, they actually look just fine when I get them to display on the page.

The problem comes with the searching. When I search for an English title, or anyting using latin, it works just fine, but when I enter a Cyrillic/Japanese/Korean search string, I get no results whatsoever. Any idea why that is happening and how I can fix it?
Author

RE: PostgreSQL Non-Latin Characters

spyware
Member



Posts: 4192
Location: The Netherlands
Joined: 14.04.07
Rank:
God
Warn Level: 90
Posted on 23-05-11 20:45
You probably need to either enforce UTF-8 encoding on the page where you search, or/and convert the string in PHP to something usable.



img507.imageshack.us/img507/3580/spynewsig3il1.png
"The chowner of property." - Zeph
[small]
Widespread intellectual and moral docility may be convenient for leaders in the short term,
but it is suicidal for nations in the long term.
- Carl Sagan
“Since the grid is inescapable, what were the earlier lasers about? Does the corridor have a sense of humor?” - Ebert
[/s
http://bitsofspy.net
Author

RE: PostgreSQL Non-Latin Characters

BrandonHeat
Member



Posts: 12
Location:
Joined: 26.02.11
Rank:
Guest
Posted on 23-05-11 20:53
I had already enforced UTF-8 on the search page and though it should be enough, but converting the string in PHP actually did the trick. Thanks.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Edit: It appears this isn't over yet! Things got even weirder now. Now it works with Japanese/Korean as well, but what's weird is that it works for Cyrillic only if I copy the word directly from the XML file, but if I input it from the keyboard I get no results. This doesn't make any sense to me, and I'm even more confused now :right:
Any idea how to solve that?

P.S Here's the title copied from the XML: El5;k6;l0;k4;al3; hol4;m8;m0; l5;a j1;aol9;a
Here's the same title typed using the keyboard: h5;l5;k6;l0;k4;k2;l3; hl6;l4;m8;m0; l5;k2; j1;k2;l6;l9;k2;

Edited by BrandonHeat on 23-05-11 23:16
Author

RE: PostgreSQL Non-Latin Characters

spyware
Member



Posts: 4192
Location: The Netherlands
Joined: 14.04.07
Rank:
God
Warn Level: 90
Posted on 23-05-11 23:21
It might be the case that the xml editor you're using sanitizes the data when it displays/saves it, ie. while those two sentences look the same, the data you copy is different.

Try echoing the copied and typed string in PHP, and see how they differ (paste back the results here if you like).



img507.imageshack.us/img507/3580/spynewsig3il1.png
"The chowner of property." - Zeph
[small]
Widespread intellectual and moral docility may be convenient for leaders in the short term,
but it is suicidal for nations in the long term.
- Carl Sagan
“Since the grid is inescapable, what were the earlier lasers about? Does the corridor have a sense of humor?” - Ebert
[/s
http://bitsofspy.net
Author

RE: PostgreSQL Non-Latin Characters

BrandonHeat
Member



Posts: 12
Location:
Joined: 26.02.11
Rank:
Guest
Posted on 24-05-11 11:21
I get the same result when echoing them: copied one - El5;k6;l0;k4;al3; hol4;m8;m0; l5;a j1;aol9;a, typed one - h5;l5;k6;l0;k4;k2;l3; hl6;l4;m8;m0; l5;k2; j1;k2;l6;l9;k2;.
I tried comparing them online using http://www.textdiff.com/, and the result is that they are 100% different... I'm not really sure why that is and even if they are using a different encoding or something(like windows-1251 and UTF-8), I convert both of them to UTF-8 before searching so there shouldn't be a problem with that.
I'm really at a loss here.

SOLVED: In the end it was just bad luck I guess. When trying Cyrillic I was always searching for the first entry and didn't try the others because I figured they wouldn't work as well - turned out they did and the first one was the only one that wasn't working. Thinking back I figured I typed all of the others by hand, but for this one I was lazy so I just copied the title from another site, which was obviously using a different encoding. I was converting it to UTF-8 anyway, but I guess it didn't work properly. Doesn't matter now - I just updated the XML entry and typed it myself, and everything is OK now.

Edited by BrandonHeat on 27-05-11 10:26