Issue Details (XML | Word | Printable)

Key: CONF-10140
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Major Major
Assignee: Christopher Owen [Atlassian]
Reporter: Neeraj Jhanji [Atlassian]
Votes: 1
Watchers: 2
Operations

If you were logged in you would be able to see more operations.
Confluence

XML-RPC does not handle Japanese characters in page title and content

Created: 04/Dec/07 09:29 PM   Updated: 13/Jan/08 07:11 PM
Component/s: Remote API (SOAP & XML-RPC)
Affects Version/s: 2.6.0
Fix Version/s: 2.7.1

Time Tracking:
Not Specified

Environment: all
Issue Links:
Reference
 

Participants: Christopher Owen [Atlassian] and Neeraj Jhanji [Atlassian]
Since last comment: 44 weeks, 4 days ago
Resolution Date: 13/Jan/08 07:11 PM
Labels:


 Description  « Hide
If you try to update a page with Japanese text in the title or description, the characters get corrupted.

See example below of a Ruby script. Python script did not work either.

The Japanese characters are UTF-8 encoded.

require 'xmlrpc/client'

server = XMLRPC::Client.new("ZZZ.canon.co.jp","/rpc/xmlrpc","8080")

token = server.call("confluence1.login","XXX","YYY")

newpage = {"title"=> "NewPage",
   "content" => "h1. new content\n日本語\nh2. new content2\n",
   "space"=>"test"}
server.call("confluence1.storePage", token, newpage);

server.call("confluence1.logout", token)


 All   Comments   Work Log   Change History      Sort Order: Ascending order - Click to sort in descending order
Christopher Owen [Atlassian] added a comment - 05/Dec/07 12:15 AM
This cause of this seems to be that the Apache XML RPC library uses the platform default encoding to interpret incoming RPC requests, rather than standard XML mechanisms. Unless your default platform encoding matches the encoding of strings sent in the RPC call, internationalised characters are going to be corrupted.

I'll investigate how we can prevent this stupidity.


Neeraj Jhanji [Atlassian] added a comment - 05/Dec/07 12:52 AM
Is there a simple way to change the platform default encoding? By platform default, do you mean the OS?

Christopher Owen [Atlassian] added a comment - 05/Dec/07 01:06 AM
I believe you can set the system property file.encoding to the character encoding you want to use. I haven't 100% verified what's going on yet but changing it to UTF-8 might get the request to complete properly. I can't make guarantees that other bad things won't happen if you change the default; it would be wise to test first.

Neeraj Jhanji [Atlassian] added a comment - 05/Dec/07 04:05 AM
Upon further investigation, it seems XML-RPC handles Japanese characters fine if you set LANG to UTF-8. Perhaps we need to mention about this somewhere inside the startup.sh file?

Christopher Owen [Atlassian] added a comment - 05/Dec/07 04:24 AM

Upon further investigation, it seems XML-RPC handles Japanese characters fine if you set LANG to UTF-8

That's because the VM uses a mix of locale and system properties to determine the platform default charset on startup. Unfortunately this doesn't really solve the problem - it only works in this case because your VM happens to be running with a default charset that is the same as the one the client happens to be sending the request in.

The sanest way to do this is to parse the request using the normal rules of determining XML request encodings. Either by the encoding specified explicitly in the request, or following the rules of XML document encoding determination in its absence. (WIthout a byte order mark or explicit encoding, the XML should be decoded as UTF-8)


Christopher Owen [Atlassian] added a comment - 05/Dec/07 03:56 PM
Some more specific information form the investigation: it appears that this behaviour is related to the SAX parser used by default in the Apache XMLRPC server (MinML). Looking through the source of this parser shows that request input streams are used directly with an InputStreamReader and no charset is specified. InputStreamReader will use the platform default in this case.

We should investigate configuring XML RPC to use a more capable (read correct) SAX parser.