Results of recent surveys of web pages and web usage by Global Reach and FUNDREDES show that the English language content of the Web is now down to 40% of the total web content. The major 60% is presented in other languages. Similarly, web users are now mostly non-native English speakers whose browsers default to the character set of another language. These figures are extrapolatable to show the rise of non-English languages on the web will continue - particularly in the Far Eastern languages.
A consequence of this is that search engines and other web agents are now becoming smarter about the language in which pages are written and how to present them to users.
The document character set for XML and HTML 4.0 is Unicode (aka ISO 10646). This means that HTML browsers and XML processors should behave as if they used Unicode internally. But it doesn't mean that documents have to be transmitted in Unicode. As long as client and server agree on the encoding, they can use any encoding that can be converted to Unicode.
It is very important that the character encoding of any XML or (X)HTML document is clearly labeled. This can be done in the following ways:
Content-Type: text/html; charset=EUC-JP
<?xml version="1.0" encoding="iso-8859-1" ?>
<meta http-equiv="Content-Type" content="text/html;
charset=utf-8">
With this information, clients can easily map these encodings to Unicode.
In practice, a few encodings will be preferred, most likely:
ISO-8859-1
(Latin-1), US-ASCII
,
UTF-8
, UTF-16
, also the other
encodings in the ISO-8859 series, iso-2022-jp
,
euc-kr
, and so on.
If you are producing web pages in English, you must still make the character set declarations. If you do not, then readers whose web browsers default to non-English character encodings will see your web pages as a jumble of incomprehensible strokes. These are becoming the majority of web users that you are not presenting your material to if you do not make a character set declaration. If you do, then their browsers will make the mapping and present the English text as you intended.
Several UK W3C members produce web sites using non-English material (e.g. Arabic, Chinese, Japanese). A brief survey shows that most of these are either not using character set declarations, or are using proprietary charsets such as those provided by Microsoft. If you do not use a character set declaration then the chances are that your intended audience will not be able to read the web page. If you use a proprietary character set declaration then your web pages will not be readable by the audience who do not have that character set. They will not have the character set if they do not have the proprietary operating system or browser that provides the character set. If you use these proprietary character sets you are vastly limiting you audience and market. There will be no way for their tools to map from a proprietary character set that they do not know about to Unicode.
If you are a tool producer you should ensure that your tools are capable of handling character set declarations correctly.
The UK office of W3C is currently producing a primer on Web Site Internationalisation which will be announced in this newsletter when it is available. In the meantime, there is considerable guidance available from W3C on internationalisation on the W3C internationalisation web pages
The World Wide Web Consortium (W3C) will be holding a series of one day events around Europe this spring to promote W3C technology Recommendations and show how they facilitate interoperability on the World Wide Web.
The W3C Interop Tour of Europe 2002 will be holding the following events:
The Event on the 30th May in Dublin is arranged by the UK Office of W3C and it is hoped that as many readers from the UK and Ireland as possible can attend.
The Euroweb 2002 Conference will be held at St Anne's College, Oxford, UK on the 17th and 18th December 2002. EuroWeb 2002 will be a major international forum at which research on the World Wide Web, GRIDs and Web Services is presented. EuroWeb 2002 follows on from the success of the EuroWeb 2001, which was held in Pisa in December, 2001 on the topic of the web in public administration.
8 April 2002: Jigsaw version 2.2.1 is available for download. The new version includes a security fix for URI parsing, a new JigShell utility, XHTML/HTML validation on PUT, JigEdit support for WebDAV, Apache mod_asis, and PushCache contributed by Paul Henshaw. The release notes list all new features and bug fixes. Jigsaw is W3C's leading-edge Web server platform implemented in Java. Learn more about the Jigsaw Activity.
16 April 2002: The World Wide Web Consortium today released "The Platform for Privacy Preferences 1.0 (P3P 1.0)" as a W3C Recommendation. The specification has been reviewed by the W3C Membership, who favor its adoption by industry. P3P allows people to define and publish their Web site privacy policies, and helps automate how those policies are read. P3P also gives users control over the use of their personal information on Web sites they visit, thus promoting trust and confidence in the Web. Read the press release and testimonials.
The RDF Core Working Group has released the first public Working Draft of the RDF Primer. The Resource Description Framework (RDF) is a general-purpose language for representing information in the Web. This primer provides the fundamentals required to use RDF in applications. Read about the Semantic Web Activity.
W3C's Semantic Web Advanced Development initiative announces the release of IsaViz, a visual environment for browsing and authoring RDF models represented as graphs. IsaViz has a 2.5D user interface allowing smooth zooming and navigation. IsaViz supports RDF/XML and N-Triple import and export, and SVG and PNG export. Developed by Emmanuel Pietriga of W3C and Xerox Research Centre Europe, IsaViz is based on the Xerox Visual Transformation Machine, Hewlett-Packard's Jena, Graphviz from AT&T Research, and Apache's Xerces. Learn more about IsaViz.
Browse past W3C Team talks and presentations and upcoming W3C appearances and events.
Please welcome: