Forums » Ruby-core » xmlrpc and charset=utf-8

xmlrpc and charset=utf-8
Posted by Phil Tomson (Guest)
on 18.04.2006 01:44
I'm needed to interact with an XMLRPC server written using the
xmlrpc-c library for C/C++.  I was using Ruby 1.8.4 and found that I
could not get a simple xmlrpc client written in Ruby that would
communicate with the xmlrpc-c server.

I kept getting the following error:
  /usr/local/lib/ruby/1.8/xmlrpc/client.rb:547:in `do_rpc':
HTTP-Error: 400 Bad Request   (RuntimeError)
        from /usr/local/lib/ruby/1.8/xmlrpc/client.rb:420:in `call2'
        from /usr/local/lib/ruby/1.8/xmlrpc/client.rb:410:in `call'
        from littleclient.rb:7


I tried downgrading to Ruby 1.8.2 and it worked fine.

When I investigated the difference I found the following in the
xmlrpc/client.rb file that comes with Ruby 1.8.4:
    def do_rpc(request, async=false)
      header = {
       "User-Agent"     =>  USER_AGENT,
       "Content-Type"   => "text/xml; charset=utf-8",
       "Content-Length" => request.size.to_s,
       "Connection"     => (async ? "close" : "keep-alive")
      }

This differs from the client.rb included with Ruby 1.8.2:

      def do_rpc(request, async=false)
        header = {
         "User-Agent"     =>  USER_AGENT,
         "Content-Type"   => "text/xml ",
         "Content-Length" => request.size.to_s,
         "Connection"     => (async ? "close" : "keep-alive")
        }

so I changed the code in the 1.8.4 version of client.rb to remove the
"charset=utf-8" - after that the ruby client interacted fine with the
xmlrpc-c server.

I'm wondering if utf-8 should be the default charset for Ruby's xmlrpc
client implementation?  Also, I'm wondering if perhaps it could be
selectable by adding an accessor method to the client to the Client
class?

Phil
Re: xmlrpc and charset=utf-8
Posted by Daniel Berger (Guest)
on 18.06.2006 05:01
--- Phil Tomson <rubyfan@gmail.com> wrote:

> `do_rpc':
> I tried downgrading to Ruby 1.8.2 and it worked
>        "Content-Length" => request.size.to_s,
>          "Content-Type"   => "text/xml ",
> 
> I'm wondering if utf-8 should be the default charset
> for Ruby's xmlrpc
> client implementation?  Also, I'm wondering if
> perhaps it could be
> selectable by adding an accessor method to the
> client to the Client
> class?
> 
> Phil

Was this ever addressed?  I vote for both a default of
utf8 and an accessor method.

Regards,

Dan
Re: xmlrpc and charset=utf-8
Posted by Jesse Clark (jesse-c)
on 18.09.2007 03:03
Dominique Brezinski wrote:
>>
>>         > encoding declaration to be presented to the XML processor in an
>> I read this to say that XML documents, in the absence of both external
> read entities that use them. In the absence of external character
> parameter is STRONGLY RECOMMENDED, since this information can be
> 
> 

It doesn't seem that anything ever became of this. I would like to 
re-open
the topic for discussion with another vote for defaulting the 
Content-Type
header to "text/xml; charset=utf-8" but adding an accessor so this value 
can
be overridden.

My specific need comes from trying to interface with weblog software via 
the
MetaWeblog API. Some blog packages incorrectly throw invalid 
content-type
faults because they don't recognize the charset parameter.

Currently I have overridden do_rpc to set "Content-Type"   => "text/xml" 
but
this seems less than ideal.

-Jesse
Re: xmlrpc and charset=utf-8
Posted by Martin Duerst (Guest)
on 18.09.2007 12:10
At 10:02 07/09/18, jesse_c wrote:
>>> > | In this case, MIME and XML processors MUST assume the charset is
>>> > | "us-ascii"
>>>
>>> This is interesting.  It seems to be at odds with the XML specification,
>>> which
>>> says:

It seems to be at odd, but it's not.

>>>         http://www.w3.org/TR/2006/PER-xml-20060614/#charencoding
>>>
>>>         > In the absence of information provided by an external transport
>>> protocol

The external protocol provides a MIME type of text/xml, which as
defined defaults to US-ASCII. Therefore, there is external information.

>>>         > ordinary ASCII entities do not strictly need an encoding
>>> declaration.
>>>
>>> I read this to say that XML documents, in the absence of both external
>>> encoding information or an XML declaration, must be assumed to be UTF-8.
>>> RFC3023 appears to be saying that XML documents default to US-ASCII.

Yes, if they come served with a MIME type of text/xml (without charset 
parameter),
because that's part of the definition of text/xml. Absence of an 
explicit
"us-ascii" label and absence of information are not the same. That all
may sound a bit far-fetched, but that's how things are defined in the
specs, sorry.

>It doesn't seem that anything ever became of this. I would like to re-open
>the topic for discussion with another vote for defaulting the Content-Type
>header to "text/xml; charset=utf-8" but adding an accessor so this value can
>be overridden. 

Adding an accessor is definitely a very good idea. Another idea is to
change the default to "application/xml". "application/xml" does NOT
imply US-ASCII, but (unless it comes with a charset parameter) means
'look at the XML document itself' (which in case of no BOM and no
encoding declaration means UTF-8).

>http://www.nabble.com/xmlrpc-and-charset%3Dutf-8-tf1465065.html#a12748102
>Sent from the ruby-core mailing list archive at Nabble.com.

Oh, great. That one is much easier to use than the 'default' one at
blade.nagaokaut.ac.jp.

By the way, there is now an official way to include pointers such as
the above into a mail header. Please see
http://www.ietf.org/internet-drafts/draft-duerst-archived-at-09.txt.

Regards,    Martin.


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp
Re: xmlrpc and charset=utf-8
Posted by Sean E. Russell (Guest)
on 19.06.2006 13:11
On Saturday 17 June 2006 23:00, Daniel Berger wrote:
> > I'm wondering if utf-8 should be the default charset
> > for Ruby's xmlrpc client implementation?
...
> Was this ever addressed?  I vote for both a default of
> utf8 and an accessor method.

Well... FWIW, XML documents are, unless otherwise specified by an XML
declaration, UTF8.  The HTTP header should reflect the encoding of the
payload.

--
--- SER

"As democracy is perfected, the office of president represents,
more and more closely, the inner soul of the people.  On some
great and glorious day the plain folks of the land will reach
their heart's desire at last and the White House will be adorned
by a downright moron."        -  H.L. Mencken (1880 - 1956)
Re: xmlrpc and charset=utf-8
Posted by Kazuhiro NISHIYAMA (Guest)
on 19.06.2006 19:38
>>>>> On Sun, 18 Jun 2006 12:00:19 +0900
>>>>> djberg96@yahoo.com(Daniel Berger)  said:
> 
> Was this ever addressed?  I vote for both a default of
> utf8 and an accessor method.

http://www.zvon.org/tmRFC/RFC3023/Output/chapter8.html#sub5
| This example shows text/xml with the charset parameter omitted.
| In this case, MIME and XML processors MUST assume the charset is "us-ascii"
is a reason of charset=utf-8.

A reason of no accessor method is encoding conversions
depends on platforms. (see ext/iconv/charset_alias.rb)
Re: xmlrpc and charset=utf-8
Posted by Sean Russell (Guest)
on 19.06.2006 22:39
I first sent this from the wrong email account, so if that post somehow 
makes
its way onto the list, then please forgive the repitition.

On Monday 19 June 2006 13:35, Kazuhiro NISHIYAMA wrote:
> > Was this ever addressed?  I vote for both a default of
> > utf8 and an accessor method.
>
> http://www.zvon.org/tmRFC/RFC3023/Output/chapter8.html#sub5
>
> | This example shows text/xml with the charset parameter omitted.
> | In this case, MIME and XML processors MUST assume the charset is
> | "us-ascii"

This is interesting.  It seems to be at odds with the XML specification, 
which
says:

	http://www.w3.org/TR/2006/PER-xml-20060614/#charencoding

	> In the absence of information provided by an external transport 
protocol
	> (e.g. HTTP or MIME), it is a fatal error for an entity including an
	> encoding declaration to be presented to the XML processor in an 
encoding
	> other than that named in the declaration, or for an entity which 
begins
	> with neither a Byte Order Mark nor an encoding declaration to use an
	> encoding other than UTF-8. Note that since ASCII is a subset of 
UTF-8,
	> ordinary ASCII entities do not strictly need an encoding declaration.

I read this to say that XML documents, in the absence of both external
encoding information or an XML declaration, must be assumed to be UTF-8.
RFC3023 appears to be saying that XML documents default to US-ASCII.

Now, granted, RFC3023 is a transport protocol, and they're basically 
saying
that if you don't specific the encoding then assume that the content is
US-ASCII.  However, I find it strange that they specifically require XML
processors to assume that unannotated documents are ASCII encoded, which 
is
in opposition to the XML spec.

In any case, it appears that the Ruby XML-RPC library is handling data
correctly, while the C library is not (since it appears to be ignoring 
the
HTTP header encoding information).

--- SER

Confidentiality Notice
This e-mail (including any attachments) is intended only for the 
recipients named above. It may contain confidential or privileged 
information and should not be read, copied or otherwise used by any 
other person. If you are not a named recipient, please notify the sender 
of that fact and delete the e-mail from your system.
Re: xmlrpc and charset=utf-8
Posted by Dominique Brezinski (Guest)
on 19.06.2006 23:11
On 6/19/06, Sean Russell <ser@germane-software.com> wrote:
> > | In this case, MIME and XML processors MUST assume the charset is
>         > other than that named in the declaration, or for an entity which begins
>         > with neither a Byte Order Mark nor an encoding declaration to use an
>         > encoding other than UTF-8. Note that since ASCII is a subset of UTF-8,
>         > ordinary ASCII entities do not strictly need an encoding declaration.
>
> I read this to say that XML documents, in the absence of both external
> encoding information or an XML declaration, must be assumed to be UTF-8.
> RFC3023 appears to be saying that XML documents default to US-ASCII.

You are correct in your interpretation of the XML spec, and I agree
that mentioned XMLRPC C library appears to be the flawed
implementation. The XML spec reads:

Although an XML processor is required to read only entities in the
UTF-8 and UTF-16 encodings, it is recognized that other encodings are
used around the world, and it may be desired for XML processors to
read entities that use them. In the absence of external character
encoding information (such as MIME headers), parsed entities which are
stored in an encoding other than UTF-8 or UTF-16 MUST begin with a
text declaration (see 4.3.1 The Text Declaration) containing an
encoding declaration....

And RFC 3023 states that charset parameter of the text/xml
registration is strongly recommended. The following description of the
charset parameter is straight from RFC 3023:

Although listed as an optional parameter, the use of the charset
parameter is STRONGLY RECOMMENDED, since this information can be
used by XML processors to determine authoritatively the character
encoding of the XML MIME entity.  The charset parameter can also
be used to provide protocol-specific operations, such as charset-
based content negotiation in HTTP.  "utf-8" [RFC2279] is the
recommended value, representing the UTF-8 charset.  UTF-8 is
 supported by all conforming processors of [XML].

Cheers,
Dom
Re: xmlrpc and charset=utf-8
Posted by Sean E. Russell (Guest)
on 27.09.2007 03:19
On Tuesday 18 September 2007, Martin Duerst wrote:
> >>> I read this to say that XML documents, in the absence of both external
> >>> encoding information or an XML declaration, must be assumed to be
> >>> UTF-8. RFC3023 appears to be saying that XML documents default to
> >>> US-ASCII.
>
> Yes, if they come served with a MIME type of text/xml (without charset
> parameter), because that's part of the definition of text/xml. Absence of
> an explicit "us-ascii" label and absence of information are not the same.
> That all may sound a bit far-fetched, but that's how things are defined in
> the specs, sorry.

If the external transport specifies the encoding, then it is up to the 
code
that is processing the transportation to set the encoding of the XML 
document
via the API.  The XML parser can't know anything about the transport. 
 That
is to say, it is *still* not the parser's responsibility to guess that 
the
encoding is anything other than UTF-8; it must be told otherwise.

Put another way, the code accepting the content must tell the parser 
what
encoding the stream is using, if it is using anything other than UTF-8.

> Adding an accessor is definitely a very good idea. Another idea is to
> change the default to "application/xml". "application/xml" does NOT
> imply US-ASCII, but (unless it comes with a charset parameter) means
> 'look at the XML document itself' (which in case of no BOM and no
> encoding declaration means UTF-8).

Another option is to have XMLRPC explicitly set the encoding to whatever 
the
transport says it is.  Of course, this would require that XMLRPC parse 
the
first line of the file and make sure that the encoding isn't already
specified in the document itself, but that isn't too difficult.
Re: xmlrpc and charset=utf-8
Posted by Jesse Clark (jesse-c)
on 27.09.2007 19:04
Sean E. Russell wrote:
> On Tuesday 18 September 2007, Martin Duerst wrote:
>   
<snipped the charset and specifications discussion>
> specified in the document itself, but that isn't too difficult.
>
>   
RFC 3023 states: "If an XML document -- that is, the unprocessed, source
XML document -- is readable by casual users, text/xml is preferable to
application/xml" and goes on to suggest that user agents which do not
support text/xml can display it as text/plain. "readable by casual
users" seems a little vague to me but it does seem that this paragraph
implies that the choice of MIME type should be based on the structure of
the xml document being transported.

To me this would support the case for choosing a reasonable default (
perhaps by  parsing the xml declaration, checking for a BOM, and falling
back to text/xml? ) and then also provide an accessor so the user can
choose to override the content-type and charset.