Character encoding and language for Web sites


by - posted
Character encoding and language for Web sites


You need the information about character encoding and language for Web sites when you develop HTML pages. You need it, for example, in order to validate user input in the front end and for cross browser compatibility.

Character encoding

Definition

Encoding in general refers to the process of representing information in some form.
Any character encoding involves at least two components: a set of characters and a system for representing them.
The ASCII standard is the oldest system for doing this.
Character encoding means that characters from a character set are assigned to numeric codes which then are mapped to octets.

History

ASCII
In the beginning there was ASCII. It was the first character encoding scheme developed. The characters were contained inside 7-bits.

ISO
As processors do things in the power of 2, it made sense to add an 8th bit to the standard. This meant that there were now 256 combinations. At the time, there was a lot of difference in opinion on what should be done with the 8th bit and the 128-256 character range. As usual everybody did what they wanted and to improve the situation, the ISO schemes were created. There are 16 different ISO schemes (coming in the format ISO-…-1 ).
So every ISO scheme has different characters in the new positions.

Unicode
As modern computers advanced to 32-bit and 64-bit architectures, they urged for the creation of a single character set which would combine all of the ISO schemes together. The result was Unicode, it combined all 110,000 characters into 27-bits.

UTF-8
A new problem had arisen. The majority of digital data transport was done in 8-bits at a time, so the 21-bit characters of Unicode were causing a problem as it was becoming wasteful and costly.
The solution was to develop a new multi-byte encoding scheme called UTF-8 which would use position number and key shifting to refer to each character.
Within the first byte (first 256 characters) ASCII was included to maintain backward compatibility for characters 0-127. Characters 182-191 are the key to be shifted and 192-247 are the shift key.
UTF-8 works by sending a sequence of 1-4 bytes for each character. The second, third and fourth byte add character shifts. The 2nd byte adds a single shift, the 3rd byte will add a double shift and the 4th byte will add a triple shift.

Declaring and check the character encoding

The data

Save your data (text) in the appropriate encoding from your editing environment !

Declare

Example to declare the character encoding in an XHTML page !

<head>
<meta http-equiv="content-type" content="text/html;charset=utf-8" /> this must be the first meta tag !!
..
..
</head>

Check

Ensure that there is no conflict between what you declare in the document and what the server automatically applies, since server settings override declarations in HTML pages.
Example for Firefox :
go Tools – Page Info – look for Encoding: if there is the same encoding as declared in the HTML, the document is ok.

Keyboard events (from a JavaScript perspective)

Character codes and key codes are delivered from the onkeypress and the onkeydown (onkeyup) events. These codes help you to define if a key or character is allowed in your program or not.

If the user is pressing a key, the event onkeydown or onkeyup delivers the actual key code (or scan code).

If the user is pressing a key, the event onkeypress delivers the character code. The character code can be an ASCII value. The value depends on the character encoding scheme used !

Frontend development can be confusing because there are inconsistencies in the way different browsers implement keyboard events, this information concerns older browsers.

Key codes

Definition

Every key on the keyboard has an associated numerical code (8 bit) value also known as scan code. The scan code is hardware/firmware implemented. Some keyboard standards include a scan code for each key being pressed and a different one for each key being released.

Get the key code with JavaScript

<p>
<input type="text" onkeydown="Validate(event);" ></input> 
</p>  
<p id="valfam" style="color:red"></p> 

<script> 
function Validate(event) 
{            
 var keyCode = event.keyCode; 
               
 document.getElementById("valfam").innerHTML="the key code is > " + keyCode; 
} 
</script>

Character codes

Definition

Every character has an associated numerical code value. The value depends on the character encoding scheme used.
We define printable and non printable characters. Printable characters are displayed on the screen
like A b c, but also the SPACE character is a printable one! Non printable characters are Ctrl, Alt, Enter, Shift, Backspace, Delete, Tab, etc.

Get the character code with JavaScript

<p>
<input type="text" onkeypress="Validate(event);" ></input> 
</p>  
<p id="valfam2" style="color:blue"></p>

<script> 
function Validate(event) 
{            
 var charCode = event.charCode; 
               
 document.getElementById("valfam2").innerHTML="the character code is > " + charCode;    
} 
</script>

Language and encoding in HTML pages

Basic declarations

The default language should be identified in the tag with:
HTML 4.0 : the “lang” attribute, example : lang=”fr”
XHTML 1.0 : “lang” and “xml:lang” attributes
XHTML 1.1 : “xml:lang” attribute
HTML 5 : the “lang” attribute

XHTML 1.1 example

DOCTYPE tag (is always EN ! It has nothing to do with the page language !!

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

html tag (default language for the page)

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="fr">

meta tag (a language always refers to an encoding scheme

<meta http-equiv="content-type" content="text/html;charset=utf-8" />

HTML 5 example

<!DOCTYPE html>			        (doctype)
<html lang="fr">			(language)
<meta charset="UTF-8" />		(encoding)

Specific language stuff

For one tag

<p lang="fr"> Text for a specific language</p>

For one character (example : HTML accent code, will display the character : è)

<p> & e g r a v e ; to write without spaces !</p> 

Cross browser compatibility

Define the character encoding like :
for XHTML :<meta http-equiv="content-type" content="text/html;charset=utf-8" />
for HTML 5 : <meta charset="UTF-8" />

Check the server side encoding with Firefox : go Tools – Page Info and look for UTF-8

Check for the correct/desired language definition(s) in the Web page

Check the character codes with different browsers, if they are interpreted in the same way

This will be a good basis for a cross browser compatible input validation and displaying the right characters !


If you enjoyed this article, you can :
– get post updates by subscribing to our e-mail list
– share on social media :