Ticket #431 (testing task)

Opened 7 years ago

Last modified 13 months ago

Using UTF-8 to store data in the DB

Reported by: DarTar Owned by: fishy
Priority: highest Milestone: 1.4
Component: localization Version: 1.1.6.2
Severity: blocker Keywords: i18n l10n unicode multilingual
Cc: KrzysztofTrybowski

Description (last modified by DotMG) (diff)

As part of the works on i18n/l10n we should consider using UTF-8 as a charset/collation for the MySQL tables used by Wikka (where MySQL >=4.1). This requires setting the charset for all tables to UTF-8 and adding SET NAMES UTF8 to SQL queries. Upgrade routines for users currently running a Wikka version with data encoded in a different charset will also be needed.

Related tickets

  • #405 coordination ticket
  • #340 string translation
  • #137 entities conversion in the edit screen
  • #191 Making CamelCase optional
  • #966 Unlimited Article-Names
  • #1008 Review of installer/upgrader in 1.3
  • #1146 Serve UTF-8 charset in HTTP header

Refs

Attachments

0001-set-names-before-mysql-queries.patch Download (1.7 KB) - added by fishy 5 years ago.
patch to query the SET NAMES sql after connect
0002-use-utf-8-as-charset.patch Download (0.7 KB) - added by fishy 5 years ago.
use utf-8 instead of iso-8859-1 for template
0003-initial-convert.php-for-body-only.patch Download (1.3 KB) - added by fishy 5 years ago.
initial sql to alter the charset of database/tables/fields
convert.php Download (1.3 KB) - added by fishy 5 years ago.
a modified version of convert.php for case 2
newconvert.php Download (1.9 KB) - added by DarTar 4 years ago.
new version of convert script by fishy
newconvert2.php Download (2.4 KB) - added by BrianKoontz 4 years ago.
Updated version of newconvert.php, with various fixes
create_table.patch Download (3.2 KB) - added by fishy 4 years ago.
patch for create table statements to add utf8 charset
newconvert3.php Download (9.6 KB) - added by fishy 4 years ago.
fulfilled newconvert with all the fields needed to be converted
fixDoubleUTF-8encodedStrings.php Download (1.6 KB) - added by KrzysztofTrybowski 4 years ago.
Script converts pages, notes and comments what were encoded as UTF-8 twice into correct UTF-8. So this fixes the double-unicoded issue. ATTN: if run on correct UTF-8 database, will destroy it!
fix_double_utf8-encoded_strings_v0_2.php Download (14.6 KB) - added by KrzysztofTrybowski 3 years ago.

Change History

  Changed 5 years ago by DarTar

  • owner changed from unassigned to fishy
  • status changed from new to assigned
  • milestone changed from 2.0 to 1.3

fishy, I'm assigning this to you as a reference for work on Unicode support. Could you please upload your patches as attachments to this ticket?

Changed 5 years ago by fishy

patch to query the SET NAMES sql after connect

Changed 5 years ago by fishy

use utf-8 instead of iso-8859-1 for template

Changed 5 years ago by fishy

initial sql to alter the charset of database/tables/fields

  Changed 5 years ago by fishy

That's take it case by case:

1. original install, db & template untouched. this is the most common case, and can be fixed by the 3 patches. 1. db untouched, but template charset changed to utf-8 (that is, applied patch 0002). this is the most likely action by some users who need utf-8. I'll give a fix on this case later. 1. db untouched, but template charset changed to something other than iso-8859-1 and utf-8 (gbk, big5, etc.). as there're so many charsets, we can't handle them one by one, so maybe we can only provide some instructions to the user and ask them to fix it themselves. 1. someone may also alter/convert the db, this case is complicated and I suggest we only give some instructions and ask the users to fix it manually.

  Changed 5 years ago by fishy

I should preview it before submit...

That's take it case by case:

1. original install, db & template untouched. this is the most common case, and can be fixed by the 3 patches.

2. db untouched, but template charset changed to utf-8 (that is, applied patch 0002). this is the most likely action by some users who need utf-8. I'll give a fix on this case later.

3. db untouched, but template charset changed to something other than iso-8859-1 and utf-8 (gbk, big5, etc.). as there're so many charsets, we can't handle them one by one, so maybe we can only provide some instructions to the user and ask them to fix it themselves.

4. someone may also alter/convert the db, this case is complicated and I suggest we only give some instructions and ask the users to fix it manually.

Changed 5 years ago by fishy

a modified version of convert.php for case 2

  Changed 5 years ago by fishy

the attachment convert.php will work on case 2, but not on case 1. I can't find a way to work on both case 1 and case 2 yet :(

  Changed 5 years ago by DarTar

[reporting from the dev list, do you receive mail there?]

  • we should take care of users who are already using Unicode in Wikka (which gets stored as literal entities). Convert should try to translate these strings into real Unicode IMO.
  • the first places where I found the patch to break are the /raw handler and the search actions
  • you may want to take a look at our  coding guidelines: all code that makes it to an official release should comply with these recommendations in terms of code layout, variable names etc.

  Changed 4 years ago by DarTar

  • cc KrzysztofTrybowski added

(adding Krzystof to the ticket's CC list)

  Changed 4 years ago by DarTar

[from the dev list]

The attachment is a file to replace the conver.php in my last patches, for the raw body to be OK. It will echo the converted text, which is good and in UTF-8, but while it write the text back to the database using UPDATE query, the encoding becomes wrong, like I stored the utf8 text into a latin1 db. But I do indeed converted the db into utf-8 before update, and excuted the "SET NAMES utf-8" query.

So anyone have a clue about it?

Changed 4 years ago by DarTar

new version of convert script by fishy

  Changed 4 years ago by BrianKoontz

  • priority changed from normal to highest

  Changed 4 years ago by NilsLindenberg

  • severity changed from normal to blocker

Changed 4 years ago by BrianKoontz

Updated version of newconvert.php, with various fixes

  Changed 4 years ago by BrianKoontz

I've updated newconvert.php (see newconvert2.php). PHP is incredibly inconsistent with how it handles encoding translations. After many different attempts, I discovered that iconv() butchers any characters that are not in the input charset (in this case, ISO-8859-1), which causes literal characters to be trashed. So I've replaced that call with html_entity_decode(), which correctly ignores characters that can't be converted. I've tested this against the  WikkaMultilanguageTestPage (I also copied the literal characters as well to ensure they weren't munged up in the conversion process).

Changed 4 years ago by fishy

patch for create table statements to add utf8 charset

Changed 4 years ago by fishy

fulfilled newconvert with all the fields needed to be converted

  Changed 4 years ago by BrianKoontz

(In [1628]) Default charset changed to UTF-8. Refs #431.

follow-up: ↓ 14   Changed 4 years ago by BrianKoontz

(In [1630]) Removed CamelCase requirements for page names. Refs #431.

  Changed 4 years ago by BrianKoontz

  • description modified (diff)

in reply to: ↑ 12   Changed 4 years ago by BrianKoontz

Replying to BrianKoontz:

(In [1630]) Removed CamelCase requirements for page names. Refs #431.

This more properly belongs to #191.

follow-up: ↓ 16   Changed 4 years ago by BrianKoontz

Comments from #wikka:

(23:15:29) < manchot> brianko, the Display of Chinese is fine, but searching

with Chinese does not yield any results.

(23:19:48) < manchot> can't search in Chinese. We need no spaces between

Chinese characters, it is just a string of unspaced words.

(23:23:25) < manchot> when I comment on a page that comes with wikka install,

I have this error message when clicking on display this comment: This page name is invalid. Valid page names must not contain the characters | ? = < > / ' " or &.

in reply to: ↑ 15   Changed 4 years ago by fishy

Replying to BrianKoontz:

Comments from #wikka: (23:15:29) < manchot> brianko, the Display of Chinese is fine, but searching with Chinese does not yield any results. (23:19:48) < manchot> can't search in Chinese. We need no spaces between Chinese characters, it is just a string of unspaced words. (23:23:25) < manchot> when I comment on a page that comes with wikka install, I have this error message when clicking on display this comment: This page name is invalid. Valid page names must not contain the characters | ? = < > / ' " or &.

For searching, yes we need to remove leading and trailing spaces, but this will also hurt english searching.

For the error message, maybe this is related to CamelCase? I can't reproduce it.

  Changed 4 years ago by BrianKoontz

fishy, don't worry about the second error...that's probably due to some ongoing work I'm doing. Do you think you have time to look into the search issue?

  Changed 4 years ago by fishy

I've tried and googled, seems that the MySQL match-against full text search just didn't support Chinese word segment, so the simple answer is that we can't do it with it. Choices are:

1. use "like %$word%" query instead of match-against, but this will cause some problem like return "googleplex" for search phrase "google", as we can't add leading and trailing spaces in order to get Chinese search works. 2. drop Chinese search support, ask user to use google "phrase site:yourwiki.com" query instead until MySQL updates (and added Chinese word-segment support).

I prefer 2. 1 doesn't really solve the problem.

  Changed 4 years ago by fishy

Below is the code WordPress uses (they are licensed under GPL, same as us):

$n = !empty($q['exact']) ? '' : '%';
$searchand = '';
foreach( (array) $q['search_terms'] as $term ) {
        $term = addslashes_gpc($term);
        $search .= "{$searchand}(($wpdb->posts.post_title LIKE '{$n}{$term}{$n}') OR ($wpdb->posts.post_content LIKE '{$n}{$term}{$n}'))";
        $searchand = ' AND ';
}

They are using the "LIKE" way, it works for Chinese, and will get "googleplex" for search phrase "google", and didn't support "+", "-" operators.

  Changed 4 years ago by fishy

This is the code worked in wikka:

function FullTextSearch($phrase, $caseSensitive = 0)
{
        // Should work with any browser/entity conversion scheme
        $search_phrase = mysql_real_escape_string($phrase);
        $sql  = 'select * from '.$this->config['table_prefix'].'pages WHERE latest = '.  "'Y'";
        foreach( (array) $search_phrase as $term ) 
                $sql .= " AND ((`tag` LIKE '%{$term}%') OR (body LIKE '%{$term}%'))";
        
        $data = $this->LoadAll($sql);
        return $data;
}

I'm not sure why the old code have this line (what's id to do with case-sensitive?)

if ( 1 == $caseSensitive ) $id = ', id';

As we use db collation utf8_general_ci, in which "ci" means case-insensitive, I don't think that we have the ability to search case-sensitive, so I just ignored it.

And yes, we'll lose the ability like "+apple -macintosh", "apple*" and ""some words"".

  Changed 4 years ago by BrianKoontz

Note to self: Replace foreach line above with:

foreach( explode(' ', $search_phrase) as $term )

  Changed 4 years ago by DarTar

The issue with no search results does not seem to be specific to Chinese, it affects any non-Latin string.

  Changed 4 years ago by BrianKoontz

(In [1713]) Installer now converts new and existing db and table/field types to utf8_unicode_ci. Note this update does *not* convert data entities to UTF-8! This will have to be done by the sysadmin using the script provided or by some other means external to Wikka. Refs #431

  Changed 4 years ago by BrianKoontz

(In [1714]) Implemented UTF-8 compatible searching, configurable with the utf8_compat_search param in wikka.config.php. Lang files other than 'en' need to be modified. Refs #431

  Changed 4 years ago by BrianKoontz

  • status changed from assigned to testing

  Changed 4 years ago by BrianKoontz

(In [1715]) Enabled SET NAMES for proper MySQL UTF-8 functionality; added script to convert HTML entities in the database to UTF-8. Refs #431

  Changed 4 years ago by BrianKoontz

  • description modified (diff)

  Changed 4 years ago by DarTar

  Changed 4 years ago by KrzysztofTrybowski

The script doesn't convert “double-unicoded” strings, that are a result of using pre-1.3 Wikka with template modified to use UTF-8 HTML charset declaration.

Such fix can be done this way:

$body = mb_convert_encoding($body, 'Windows-1252', 'UTF-8');

...but it would destroy the data if ran on a proper utf-8 database. Perhaps we should provide a script for that, but it should first run some checks and issue warnings etc.

Also I dislike the fact that the script creates new revisions of pages. Charset conversion is a purely technical task, and thus shouldn't leave any trace in the data.

Changed 4 years ago by KrzysztofTrybowski

Script converts pages, notes and comments what were encoded as UTF-8 twice into correct UTF-8. So this fixes the double-unicoded issue. ATTN: if run on correct UTF-8 database, will destroy it!

  Changed 3 years ago by DarTar

bump - will this be bundled with the 1.3 release?

follow-ups: ↓ 32 ↓ 33   Changed 3 years ago by BrianKoontz

Should be working.

in reply to: ↑ 31   Changed 3 years ago by fishy

Replying to BrianKoontz:

Should be working.

If it really works, then I think we can bundle it with 1.3 release. But I think we may need more testing. I'm extremely busy these days, maybe I'll get some time during new year holiday (Jan. 1~3) to do it.

in reply to: ↑ 31 ; follow-up: ↓ 34   Changed 3 years ago by fishy

Replying to BrianKoontz:

Should be working.

Here're some testing results:

1. original version upgrade install a fresh new 1.2p1 wikka, add some Chinese and some accents, copy 1.3 files over (from svn 1.3 branch), (auto) run the upgrade script.

result: good: 1.3 version is in UTF-8 and both Chinese and accents are displayed correctly. bad: the Chinese characters, stored in database and displayed while editing, is in format &#NUM; (e.g. &#20013;&#25991;). seems that the preg_replace_callback part in newconvert3.php is missing.

2. utf8 db version upgrade same as case 1, but create the database with collation utf8-general-ci, so 1.2 will create tables with collation utf8-general-ci

result: also same as case 1.

3. template modified version upgrade install a fresh new 1.2p1 wikka, modify the header of the template and change the charset from "iso-8859-1" to "utf-8", add some Chineses and some accents, the rest is the same as above.

result: FAILED. both Chinese and accents characters are destroyed.

We don't have to handle case 3 automatically. But as this is a easy fix for prior-1.3, I think many CJK wikka admins will do this, we should give some serious warning in the upgrade to ask the user to backup his db before upgrade, and we may provide additional script for this situation, or provide a wiki page for them include some technical informations so they can fix it themselves.

btw, for case 1&2, CJK users will see &#NUM;s while editing pages that contains CJK characters for prior-1.3, this makes it nearly unusable (I can't read the existing texts), so I doubt if there is any CJK user in case 1&2.

Another thing is after upgrading, revision notes are missing (no matter it's CJK, english or accents), seems to be a bug

in reply to: ↑ 33   Changed 3 years ago by KrzysztofTrybowski

Replying to fishy:

Replying to BrianKoontz:

Should be working.

Here're some testing results: 1. original version upgrade install a fresh new 1.2p1 wikka, add some Chinese and some accents, copy 1.3 files over (from svn 1.3 branch), (auto) run the upgrade script. result: good: 1.3 version is in UTF-8 and both Chinese and accents are displayed correctly. bad: the Chinese characters, stored in database and displayed while editing, is in format &#NUM; (e.g. &#20013;&#25991;). seems that the preg_replace_callback part in newconvert3.php is missing.

The script newconvert3.php is now called convertHTMLEntities.php. It's not run automatically, so it's quite obvious that during editing you'll see HTML entities. I think that a user should have a choice to run the script during update.

If you run the script, it works. The flaw — IMHO — is that it creates a new revision of an article instead of just converting everything. But it's a case of design.

If we don't want to run the script automatically, we may at least use the “preg_replace_callback part” during editing. So if a page is not converted from HTML entities, it will be converted just before editing. One side effect: if a user intentionally types a HTML entity, it will be converted on next edit. But that's not an issue — HTML entities are a workaround that can be used when text is not unicode. If we have unicode, no point using HTML entities.

3. template modified version upgrade install a fresh new 1.2p1 wikka, modify the header of the template and change the charset from "iso-8859-1" to "utf-8", add some Chineses and some accents, the rest is the same as above. result: FAILED. both Chinese and accents characters are destroyed.

Yes, this is to be expected. It can be easily converted, but the conversion has this unfortunate trait that if run twice on the same text, or if run on a proper UTF-8, it will destroy it. Automatic detection of such cases is not possible.

We don't have to handle case 3 automatically.

Automatically — we can't. But there should exist a script for this, which would present user with some warnings, perhaps some samples of text before and after conversion and let the user decide what to do. Perhaps backups of the changed material should be made.

Note that this situation affects not only CJK users, but all non-ISO-8859-1 users. For example Polish and other Central European languages, but also Cyrillic: Ukrainian, Russian...

btw, for case 1&2, CJK users will see &#NUM;s while editing pages that contains CJK characters for prior-1.3, this makes it nearly unusable (I can't read the existing texts), so I doubt if there is any CJK user in case 1&2.

Also Cyrillic is unusable and for example Polish is hardly usable. On my site I used a “dirty UTF-8 hack”, so I'm in case 3.

Another thing is after upgrading, revision notes are missing (no matter it's CJK, english or accents), seems to be a bug

Yes, there is a bug in line 470 of the installer. Field “note” is having its type changed to ENUM( 'Y','N' ). It's probably a copy-paste error related with a line above it, which regards field “latest”, which indeed is ENUM( 'Y','N' ).

  Changed 3 years ago by KrzysztofTrybowski

Regarding installer bug mentioned above, see #1021.

  Changed 3 years ago by KrzysztofTrybowski

I attach a new version of a script to convert “double unicoded” strings, that are a result of using pre-1.3 Wikka with template modified to use UTF-8 HTML charset declaration.

Version 0.2 can be run only from command-line but this will change later. The script is lengthy, since it performs some checks in order not to break anything (or break only a little ;)).

This is tested on all Wikka databases I have, which is not much. Please test some more, or share databases with me, so that I can test.

Note1: Script sets an additional configuration variable to wikka.config.php — dbl_utf8_conversion_run. This is needed to prevent accidental multiple runs over the same data.

Note2: Script uses seconds in pages' and comments' timestamps to mark those of them that are already processed. Pages/comments having seconds = 01 are in the queue, those having seconds = 02 are already processed. This means that original information about timestamp's seconds is lost. I consider this a minor issue, but if anybody has a better idea about how to mark pages/comments already processed, feel free to share :)

More information, manual and @todo list — in the attached file.

Changed 3 years ago by KrzysztofTrybowski

  Changed 3 years ago by BrianKoontz

  • status changed from testing to assigned

  Changed 3 years ago by BrianKoontz

  • status changed from assigned to testing
  • milestone changed from 1.3 to 1.4

  Changed 3 years ago by BrianKoontz

This link explains the complexity of converting a non-UTF-8 database to UTF-8:

 http://codex.wordpress.org/Converting_Database_Character_Sets

  Changed 13 months ago by DotMG

  • description modified (diff)
Note: See TracTickets for help on using tickets.