Thursday, August 25, 2005

The UTF-8 encoding rules

1. Characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8. A single byte is needed for any of these characters!
2. All characters >U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no ASCII byte (0x00-0x7F) can appear as part of any other character.
3. The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0x80 to 0xBF. This allows easy resynchronization andmakes the encoding stateless and robust against missing bytes
4. All possible 231 UCS codes can be encoded
5. UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit characters (the ones implicitly supported in the UCS-2 encoding, and by Str Library) are only up to three bytes long
6. The sorting order of Bigendian UCS-4 byte strings is preserved
7. The bytes 0xFE and 0xFF are never used in this encoding

The following byte sequences are used to represent a character:

U-00000000 - U-0000007F: 0xxxxxxx
U-00000080 - U-000007FF: 110xxxxx 10xxxxxx
U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

The xxx bit positions are filled with the bits of the character code number in binary representation. The rightmost x bit is the least-significant bit. Only the shortest possible multibyte sequence which can represent the code number of the character can be used. Note that in multibyte sequences, the number of leading 1 bits in the first byte is identical to the number of bytes in the entire sequence.

Wednesday, August 24, 2005

Form Submission

Form content types
The enctype attribute of the FORM element specifies the content type used to encode the form data set for submission to the server. User agents must support the content types listed below. Behavior for other content types is unspecified.
Please also consult the section on
escaping ampersands in URI attribute values.
application/x-www-form-urlencoded
This is the default content type. Forms submitted with this content type must be encoded as follows:
Control names and values are escaped. Space characters are replaced by
`+', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by `%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., `%0D%0A').
The control names/values are listed in the order they appear in the document. The name is separated from the value by
`=' and name/value pairs are separated from each other by `&'.
multipart/form-data
Note. Please consult
[RFC2388] for additional information about file uploads, including backwards compatibility issues, the relationship between "multipart/form-data" and other content types, performance issues, etc.
Please consult the appendix for information about
security issues for forms.
The content type "application/x-www-form-urlencoded" is inefficient for sending large quantities of binary data or text containing non-ASCII characters. The content type "multipart/form-data" should be used for submitting forms that contain files, non-ASCII data, and binary data.
The content "multipart/form-data" follows the rules of all multipart MIME data streams as outlined in
[RFC2045]. The definition of "multipart/form-data" is available at the [IANA] registry.
A "multipart/form-data" message contains a series of parts, each representing a
successful control. The parts are sent to the processing agent in the same order the corresponding controls appear in the document stream. Part boundaries should not occur in any of the data; how this is done lies outside the scope of this specification.
As with all multipart MIME types, each part has an optional "Content-Type" header that defaults to "text/plain". User agents should supply the "Content-Type" header, accompanied by a "charset" parameter.
Each part is expected to contain:
a "Content-Disposition" header whose value is "form-data".
a name attribute specifying the
control name of the corresponding control. Control names originally encoded in non-ASCII character sets may be encoded using the method outlined in [RFC2045].
Thus, for example, for a control named "mycontrol", the corresponding part would be specified:

Content-Disposition: form-data; name="mycontrol"
As with all MIME transmissions, "CR LF" (i.e., `%0D%0A') is used to separate lines of data.
Each part may be encoded and the "Content-Transfer-Encoding" header supplied if the value of that part does not conform to the default (7BIT) encoding (see
[RFC2045], section 6)
If the contents of a file are submitted with a form, the file input should be identified by the appropriate
content type (e.g., "application/octet-stream"). If multiple files are to be returned as the result of a single form entry, they should be returned as "multipart/mixed" embedded within the "multipart/form-data".
The user agent should attempt to supply a file name for each submitted file. The file name may be specified with the "filename" parameter of the 'Content-Disposition: form-data' header, or, in the case of multiple files, in a 'Content-Disposition: file' header of the subpart. If the file name of the client's operating system is not in US-ASCII, the file name might be approximated or encoded using the method of
[RFC2045]. This is convenient for those cases where, for example, the uploaded files might contain references to each other (e.g., a TeX file and its ".sty" auxiliary style description).
The following example illustrates "multipart/form-data" encoding. Suppose we have the following form:

enctype="multipart/form-data"
method="post">


What is your name?

What files are you sending?



If the user enters "Larry" in the text input, and selects the text file "file1.txt", the user agent might send back the following data:
Content-Type: multipart/form-data; boundary=AaB03x

--AaB03x
Content-Disposition: form-data; name="submit-name"

Larry
--AaB03x
Content-Disposition: form-data; name="files"; filename="file1.txt"
Content-Type: text/plain

... contents of file1.txt ...
--AaB03x--

If the user selected a second (image) file "file2.gif", the user agent might construct the parts as follows:
Content-Type: multipart/form-data; boundary=AaB03x

--AaB03x
Content-Disposition: form-data; name="submit-name"

Larry
--AaB03x
Content-Disposition: form-data; name="files"
Content-Type: multipart/mixed; boundary=BbC04y

--BbC04y
Content-Disposition: file; filename="file1.txt"
Content-Type: text/plain

... contents of file1.txt ...
--BbC04y
Content-Disposition: file; filename="file2.gif"
Content-Type: image/gif
Content-Transfer-Encoding: binary

...contents of file2.gif...
--BbC04y--
--AaB03x--