Ticket #999 (assigned defect)

Opened 4 years ago

Last modified 2 years ago

makeId() and Unicode

Reported by: DarTar Owned by: BrianKoontz
Priority: highest Milestone: 1.4
Component: core Version: 1.3
Severity: blocker Keywords: utf8
Cc:

Description (last modified by BrianKoontz) (diff)

1.3 is introducing automatic fragment linking of headings via a new core method written by JavaWoman, called Wakka::makeId(). The trouble with this method is that it generates meaningless id's for non-ASCII extended latin content (this does not apply to non-Latin characters, e.g. Chinese or Arabic, as meaningless hashes are used in this case).

To give an example, the Polish heading: Użyteczne strony produces the following id via makeId():hn_Uyteczne_strony which is missing the ż character, and hence meaningless in Polish.

The method correctly applies the  HTML4.0 specs which specify for id's the following naming rules:

Must begin with a letter A-Z or a-z Can be followed by: letters (A-Za-z), digits (0-9), hyphens ("-"), underscores ("_"), colons (":"), and periods (".") Values are case-sensitive

We should fix this in one of the following ways:

  • (1) use a conversion table to ASCIIfy extended Latin characters so that all ż are converted to z
  • (2) keep extended latin characters in the fragment id if the XHTML specs allow this
  • (3) just escape every non ASCII character  as MediaWiki does

I am afraid that this will need to be addressed in 1.3 if automatic fragment linking is introduced in this release, as we won't be able to make any changes to the method once people start using these ids.

Related tickets

#970 Document 1.3 features. Not currently documented as it appears to be broken.

Change History

  Changed 4 years ago by KrzysztofTrybowski

This might be handy — specification for XML:

 http://www.w3.org/TR/2008/REC-xml-20081126/#sec-common-syn  http://www.w3.org/TR/2008/REC-xml-20081126/#id

Unicode characters are generally allowed.

OTOH, XHTML 1.0 documentation says in  http://www.w3.org/TR/xhtml1/#h-4.10: “See the HTML Compatibility Guidelines for information on ensuring such anchors are backward compatible when serving XHTML documents as media type text/html.”

And then in  http://www.w3.org/TR/xhtml1/#C_8: “Note that the collection of legal values in XML 1.0 Section 2.3, production 5 is much larger than that permitted to be used in the ID and NAME types defined in HTML 4. When defining fragment identifiers to be backward-compatible, only strings matching the pattern [A-Za-z][A-Za-z0-9:_.-]* should be used. See Section 6.2 of [HTML4] for more information.”

  Changed 4 years ago by DarTar

  • description modified (diff)

ok so that confirms we should stick to the original pattern: the question is if we should go for option (1) or (3) then. (adding numbers in the ticket description)

follow-up: ↓ 4   Changed 4 years ago by BrianKoontz

Wow. What MW does is rather ugly (like many things MW does). So my question to those who use non-ASCII extended characters: Which method is least offensive?

in reply to: ↑ 3   Changed 4 years ago by DarTar

Replying to BrianKoontz:

So my question to those who use non-ASCII extended characters: Which method is least offensive?

In Italian, I would rather go for option (1), character replacement (e.g. à -> a), K confirmed the same would work pretty well in Polish. We may want to render umlaut-characters differently, e.g. ä -> ae but I am sure there are existing libraries we can use for this purpose.

  Changed 4 years ago by BrianKoontz

  • priority changed from normal to highest
  • severity changed from normal to blocker

  Changed 4 years ago by BrianKoontz

  • owner changed from unassigned to BrianKoontz
  • status changed from new to assigned

  Changed 4 years ago by KrzysztofTrybowski

It seems that id's in HTML5 will be allowed to contain any characters, except for a space. Even if HTML5 isn't ready yet, it might be sensible to actually use this feature, and test how it behaves in browsers. Anyway, the asciifying should be optional.

 http://www.electrictoolbox.com/html5-valid-characters-id-attribute/  http://mathiasbynens.be/notes/html5-id-class

You can test here:  http://mathiasbynens.be/demo/html5-id

  Changed 3 years ago by BrianKoontz

  • milestone changed from 1.3 to 1.4

  Changed 3 years ago by BrianKoontz

  • keywords utf8 added

  Changed 2 years ago by BrianKoontz

  • description modified (diff)
Note: See TracTickets for help on using tickets.