no-https: a plain-HTTP to HTTPS proxy

Discussion:

(too old to reply)

Ivan Shmakov

2018-09-16 07:07:35 UTC

[Cross-posting to news:comp.misc as the issue of plain-HTTP
unavailability was recently discussed there.]

It took me about a day to write a crude but apparently (more or
less) working HTTP to HTTPS proxy. (That I hope to beat into
shape and release via news:alt.sources around next Wednesday
or so. FTR, the code is currently under 600 LoC long, or 431 LoC
excluding comments and empty lines.) Some design notes are below.

Basics

The basic algorithm is as follows:

1. receive a request header from the client; we only allow
GET and HEAD requests for now, as we do not support request
/bodies/ as of yet;

2. decide the server and connect there;

3. send the header to the server;

4. receive the response header;

5. if that's an https: redirect:

5.1. connect over TLS, alter the request (Host:, "request target")
accordingly, go to step 3;

6. strip certain headers (such as Strict-Transport-Security: and
Upgrade:, but also Set-Cookie:) off the response and send the
result to the client;

7. copy up to Content-Length: octets from the server to the
client -- or all the remaining data if no Content-Length:
is given; (somewhat surprisingly, this seems to also work with
the "chunked" coding not otherwise considered in the code);

8. close the connection to the server and repeat from step 1
so long as the client connection remains active.

The server uses select(2) so that socket reads do not block and
supports an arbitrary number (up to the system-enforced limits)
of concurrent connections. For simplicity, socket writes /are/
allowed to block. (Hopefully not a problem for proxy-to-server
connections most of the time, and even less so for proxy-to-client
ones; assuming no malicious intent on the part of either,
obviously. The latter case may be mitigated by using a "proper"
HTTP proxy, such as Polipo, in the front of this one.)

Dealing with the https: references

There was an idea of transparently replacing https: references
in HTML and XML attributes with scheme-relative ones (like, e. g.,
https://example.com/ to //example.com/.) So far, that fails
more often than it works, for two primary reasons: compression
(although that can be solved by forcing Accept-Encoding: identity
in requests) -- and the fact that by the time such filtering can
take place, we've already sent the Content-Length: (if any) for
the original (unaltered) body to the client!

Also, as the code does not currently handle the "chunked" coding,
references split across chunks will not be handled. (The code
should handle references split across bufferfuls of data, though.)

Two possible ways to solve that would be to, for desired
Content-Type: values, either retrieve the whole response in full
before altering and forwarding to the client, /or/ to implement
support for "chunked" coding and force its use there (striping
Content-Length: off the original response, if any.)

I suppose both approaches can be implemented, with the first
used, say, when Content-Length: is below a configured limit,
although that increases the complexity of the code, which is
something I'd rather avoid.

That said, I don't think the https: references /should/ be an
issue in practice, as most of the links are ought to be relative
in the first place, such as:

<p ><a href="page2.html" >Continue reading of this article</a>,
or <a href="/" >go back to the top page.</a></p>

However, I suspect that images and such may be a common
exception in practice, like:

<img src="Loading Image...

" />

Which of course would've worked just as well (and require no
specific action on the part of this proxy) being written as:

<img src="//static.example.com/useless-stock-photo.jpeg" />

Making responses even better

Other possible response alterations may include removing <link />
elements and Link: HTTP headers pointing to JavaScript code
(running arbitrary software from the Web is a bad idea, and
doing so while forgoing the meager TLS protection isn't making
it better) /and/ also <script /> elements. The latter, in turn,
will probably either require rather complex state tracking --
or getting the server response in full before the alterations
can take place.

Thoughts?

--
FSF associate member #7257 np. Nine Lives -- Slaygon

Eli the Bearded

2018-09-16 20:52:00 UTC