Handling multipart/form-data natively in Python

  • A+
所属分类:Python

RFC7578 (who obsoletes RFC2388) defines the multipart/form-datatype that is usually transported over HTTP when users submit forms on your Web page. Nowadays, it tends to be replaced by JSON encoded payloads; nevertheless, it is still widely used.

While you could decode an HTTP body request made with JSON natively with Python — thanks to the json module — there is no such way to do that with multipart/form-data. That's something barely understandable considering how old the format is.

There is a wide variety of way available to encode and decode this format. Libraries such as requests support this natively without making you notice, and the same goes for the majority of Web server frameworks such as Django or Flask.

However, in certain circumstances, you might be on your own to encode or decode this format, and it might not be an option to pull (significant) dependencies.

Encoding

The multipart/form-data format is quite simple to understand and can be summarised as an easy way to encode a list of keys and values, i.e., a portable way of serializing a dictionary.

There's nothing in Python to generate such an encoding. The format is quite simple and consists of the key and value surrounded by a random boundary delimiter. This delimiter must be passed as part of the Content-Type, so that the decoder can decode the form data.

There's a simple implementation in urllib3 that does the job. It's possible to summarize it in this simple implementation:

import binasciiimport osdef encode_multipart_formdata(fields):
    boundary = binascii.hexlify(os.urandom(16)).decode('ascii')

    body = (
        "".join("--%s\r\n"
                "Content-Disposition: form-data; name=\"%s\"\r\n"
                "\r\n"
                "%s\r\n" % (boundary, field, value)
                for field, value in fields.items()) +
        "--%s--\r\n" % boundary    )

    content_type = "multipart/form-data; boundary=%s" % boundary    return body, content_type

You can use by passing a dictionary where keys and values are bytes. For example:

encode_multipart_formdata({"foo": "bar", "name": "jd"})

Which returns:

--00252461d3ab8ff5c25834e0bffd6f70
Content-Disposition: form-data; name="foo"

bar
--00252461d3ab8ff5c25834e0bffd6f70
Content-Disposition: form-data; name="name"

jd
--00252461d3ab8ff5c25834e0bffd6f70--
multipart/form-data; boundary=00252461d3ab8ff5c25834e0bffd6f70

You can use the returned content type in your HTTP reply header Content-Type. Note that this format is used for forms: it can also be used by emails.

Emails did you say?

Encoding with email

Right, emails are usually encoded using MIME, which is defined by yet another RFC, RFC2046. It turns out that multipart/form-data is just a particular MIME format, and that if you have code that implements MIME handling, it's easy to use it to implement this format.

Fortunately for us, Python standard library comes with a module that handles exactly that: email.mime. I told you it was heavily used by email — I guess that's why they put that code in the email subpackage.

Here's a piece of code that handles multipart/form-data in a few lines of code:

from email import messagefrom email.mime import multipartfrom email.mime import nonmultipartfrom email.mime import textclass MIMEFormdata(nonmultipart.MIMENonMultipart):
    def __init__(self, keyname, *args, **kwargs):
        super(MIMEFormdata, self).__init__(*args, **kwargs)
        self.add_header(
            "Content-Disposition", "form-data; name=\"%s\"" % keyname)def encode_multipart_formdata(fields):
    m = multipart.MIMEMultipart("form-data")

    for field, value in fields.items():
        data = MIMEFormdata(field, "text", "plain")
        data.set_payload(value)
        m.attach(data)

    return m

Using this piece of code returns the following:

ontent-Type: multipart/form-data; boundary="===============1107021068307284864=="
MIME-Version: 1.0

--===============1107021068307284864==
Content-Type: text/plain
MIME-Version: 1.0
Content-Disposition: form-data; name="foo"

bar
--===============1107021068307284864==
Content-Type: text/plain
MIME-Version: 1.0
Content-Disposition: form-data; name="name"

jd
--===============1107021068307284864==--

This method has several advantages over our first implementation:

  • It handles Content-Type for each of the added MIME parts. We could add other data types than just text/plain like it is implicitly done in the first version. We could also specify the charset (encoding) of the textual data.

  • It's very likely more robust by leveraging the wildly tested Python standard library.

The main downside, in that case, is that the Content-Type header is included with the content. In case of handling HTTP, it is problematic as this needs to be sent as part of the HTTP header and not as part of the payload.

It should be possible to build a particular generator from email.generator that does this. I'll leave that as an exercise to you, reader.

Decoding

We must be able to use that same email package to decode our encoded data, right? It turns out that's the case, with a piece of code that looks like this:

import email.parser


msg = email.parser.BytesParser().parsebytes(my_multipart_data)print({
    part.get_param('name', header='content-disposition'): part.get_payload(decode=True)
    for part in msg.get_payload()})

With the example data above, this returns:

{'foo': b'bar', 'name': b'jd'}

Amazing, right?

The moral of this story is that you should never underestimate the power of the standard library. While it's easy to add a single line in your list of dependencies, it's not always required if you dig a bit into what Python provides for you!

  • 我的微信
  • 这是我的微信扫一扫
  • weinxin
  • 我的微信公众号
  • 我的微信公众号扫一扫
  • weinxin

发表评论

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: