It seems that you're using an outdated browser. Some things may not work as they should (or don't work at all).
We suggest you upgrade newer and better browser like: Chrome, Firefox, Internet Explorer or Opera

×
Unicode? Is that like some kind of cyber unicorn?
avatar
onarliog: Hehe, good times, reminds me the first time I saw this back in 2013.

https://labs.spotify.com/2013/06/18/creative-usernames/
That's just a regular bug that has nothing to do with unicode. They didn't have to demand X=canonical(X), they had to check existence of canonical(X) when registering a user, aka not be fucking idiots.
I can't post links, but what you are looking for is called "unicode confusables". It is a lot easier once you know their "official" name.
Search for this term and you can find a list of unicode characters that look alike.
avatar
Gede: I can't post links, but what you are looking for is called "unicode confusables". It is a lot easier once you know their "official" name.
Search for this term and you can find a list of unicode characters that look alike.
Think if you put a link in a quote you can still post it.
Might be possible to calculate the edit distance between two usernames with customised values for specific character substitutions.
I don't think it's an optimal solution though.
avatar
onarliog: Hehe, good times, reminds me the first time I saw this back in 2013.

https://labs.spotify.com/2013/06/18/creative-usernames/
avatar
Starmaker: That's just a regular bug that has nothing to do with unicode. They didn't have to demand X=canonical(X), they had to check existence of canonical(X) when registering a user, aka not be fucking idiots.
You are entirely ignoring the problem of homoglyphs there.
avatar
Gede: I can't post links, but what you are looking for is called "unicode confusables". It is a lot easier once you know their "official" name.
neat, didn't know about that
there is even an official standard about it
https://www.unicode.org/reports/tr39/#Confusable_Detection

the widely used icu library seems to provide a nice API to use it
Unicode Security and Spoofing Detection, C API.
avatar
Starmaker: That's just a regular bug that has nothing to do with unicode. They didn't have to demand X=canonical(X), they had to check existence of canonical(X) when registering a user, aka not be fucking idiots.
avatar
onarliog: You are entirely ignoring the problem of homoglyphs there.
I googled homoglyphs and saw this.
...
Post edited June 27, 2022 by wbmatic
hmmm sounds like what they would do for spellchecking and phonetic spelling, and maybe a little combination of leet-speak.

1 L and i all look very close in a number of fonts. 1lI. Those would likely be put into a character class.


Hmmm as a programming exercise I'd probably do a sum formula which gives weights based on how closely something resembles another letter. so n and m are very close, and L and I are both thrown in the same classes, etc... E & 3? Not sure, I think they are different enough obviously.

Actually registering it internally as all going to the lowest value of the class and then using that internally for names seems like it would fit best... So you'd have a display name, and then the actual ID name that is used for checking for duplicates or look-alikes.
avatar
Starmaker: That's just a regular bug that has nothing to do with unicode. They didn't have to demand X=canonical(X), they had to check existence of canonical(X) when registering a user, aka not be fucking idiots.
avatar
onarliog: You are entirely ignoring the problem of homoglyphs there.
Er what? The problem in the linked article was that one could register a unique unicode username, have that converted to a less rich (canonical) form and grab an existing account which had the same canonical form. This is a problem regardless of what the canonical function is and what the alphabet is.

For example, the software I'm using right now allows English letters (uppercase and lowercase) and numbers for display names, then makes everything lowercase and inserts underscores if it detects camelcase. So, e.g., StarMaker and star_maker are displayed differently but would have had identical canonical representation, and this is why, if I registered one and tried to register the other it would not have let me because it'd have seen that a record with the same canonical form already exists and throw a (duh) "username already exists" alert.

Also, the default font displays lowercase L and uppercase i the same way, and it doesn't matter because the problem in the linked article -- not the one dtgreene posed -- has nothing to do with human confusion / error / deception, nothing to do with visuals, indeed, and everything with shitty code.
What you're trying to do is pretty much a bad-word filter with a dynamic word list. They come with a whole range of problems, are high maintenance and usually don't work very well. Experience shows that users pretty much always find a way to circumvent them. If it was an easy problem, you'd expect internet giants like Google to have found a working solution already, but that's not the case.

I think the best way to combat impersonation is to make it easy to detect and report impersonators, which gets reviewed by humans. After all, coolguy123 and coolguy456 might have been using these names for a long while with no malicious intentions. I've seen cases like this one in the wild.