Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary attributes in parseFilter get serialized incorrectly in python3 #160

Open
xdesai opened this issue Feb 17, 2020 · 1 comment
Open

Comments

@xdesai
Copy link
Contributor

xdesai commented Feb 17, 2020

Hi all,
We are experiencing an issue when trying to parse an LDAP filter string that contains a binary attribute such as objectguid.

Using python3.7 and the latest ldaptor here is a snippet of code that can produce the error we're seeing.

from ldaptor import ldapfilter
filter_text = '(objectguid=\\be)'
res = ldapfilter.parseFilter(filter_text)
print(repr(res))
print(res.toWire())

Output:

LDAPFilter_equalityMatch(attributeDesc=LDAPAttributeDescription(value='objectguid'), assertionValue=LDAPAssertionValue(value='¾'))
b'\xa3\x10\x04\nobjectguid\x04\x02\xc2\xbe'

The problem is a bit difficult to notice at first, but it's specifically found at the end of that message.
\\be became \xc2\xbe instead of just \xbe. This extra byte is due to the utf-8 encoding. But before we discuss that further let's step back for a second.

As far as we can tell, the way filter parsing works is that it builds up a list of characters in the filter string one at a time. Each byte is run through the _p_escaped function in ldapfilter.py which turns the hex byte into a character. Then it stores this collection of characters as a python3 string on the LDAPFilter object.

The problem manifests itself when we call the toWire() method on the ldapfilter. Each string in the filter has the to_bytes() method from _encoder.py called on it. In this situation to_bytes(value) ends up being value.encode('utf-8'). For a binary data blob this encoding is not the appropriate choice and it actually corrupts the data.

utf-8 was designed to use just 1 byte when encoding anything in the range of 0-128. However in the range 128-256 it requires 2 bytes to encode a single character. Which is why a value like \xbe (190 in decimal) can only be stored as \xc2\xbe. Storing a literal ‘\xbe` would be invalid for the encoding.

While utf-8 is actually the correct encoding for most LDAP fields this does not work for fields that contain binary data like objectguid. The extra bytes added by the utf-8 encoding essentially corrupt the field we are trying to search on and the LDAP Server will return 0 results for your search.

We have some egregious hacks in place today to make it so that objectguid filters use a latin-1 encoding. This encoding uses only 1 byte for the range 0-256 so the toWire() method returns data that the LDAPServer likes. However, we hate this and would love to see a broader and better solution adopted by ldaptor. Unfortunately we were not able to identify a proper solution ourselves. The main difficulty we ran up against when trying to develop a real solution was that at the time of encoding for the wire the LDAPAttributeValueAssertion /BerSequence has no context and we are not able to tell whether this data is binary or not. So in lieu of a PR I wanted to create this issue so that the ldaptor team was at least aware and maybe a real solution could be found.

Thanks!
Xander

@GrayAn
Copy link
Collaborator

GrayAn commented Feb 21, 2020

Hi @xdesai ,

Thank you for the issue and sorry for the long answer.

Most ldaptor classes can receive byte strings and unicode strings as arguments. It is encouraged to use byte strings as in that case these values are sent via network "as is". Unicode strings are converted to bytes before sending (see toWire method).

However ldapfilter functions need unicode strings in order to parse filter strings with the pyparsing library. If you pass byte string it is decoded to unicode and then encoded back in the toWire method as UTF-8 byte string.

(Back in Python 2 times ldaptor did not need these conversions because attribute values and transmitted values were of the same type - Python 2 str or bytes. Even if you passed unicode in most cases it was implicitly encoded to bytes. Encoding/decoding issues started when ldaptor was ported to Python 3.)

Right now I cannot find a good solution. I thought of rewriting ldapfilter module to allow using byte strings without decoding but it may be a too difficult task. Perhaps encoding may be passed as an optional argument but it looks like a large change: all ldaptor classes with toWire method will receive this argument in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants