Archive for the 'Tech' Category

Python’s negative list indexing is confusing and shouldn’t be taught to novice programmers

Saturday, March 8th, 2014

I’ve been programming for a long time, in a wide variety of languages, but never did I see negative list indexing until I learned Python nine years ago to play around with making automated edits on Wikipedia. For those who don’t know, negative list indexing allows you to index into a list from the end instead of the beginning. Whereas arr[0] returns the first item in a list, arr[-1] returns the last item in a list, arr[-2] the penultimate item, etc. It seemed like such a no-brainer; why haven’t all languages implemented this feature? It’s useful, right? That’s what I thought until I recently interviewed a spate of internship program candidates for the upcoming summer, most of whom are learning programming in Python, and I saw firsthand the confusion that negative list indexing causes in novices developers.

I do a fair amount of interviewing at my job, with an average of three or so interviews per week. We’re more often hiring developers than not. Now that I’m leading a development team, I have a huge vested interest in finding the best possible talent because inferior developers directly affect the quality of my team’s output, thus making me look bad. The main thing I focus on during interviews is programming problems. You’d be amazed at how many people do just fine in the talking portion but then fail miserably when asked to actually code something. So I use a progression of increasingly difficult programming problems, the first of which is usually “Write a function that takes a sentence and a number as input, and returns the re-arranged sentence with the specified number of words moved from the back to the front”. So for the example sentence “That is a dark brown fox” and the number 4, the output would be “a dark brown fox That is”. I ask the interviewees to solve the problem in whatever language they’re most comfortable with, because I’d rather get good developers in general than someone with certain specific language skills.

A typical solution in C#, which I get a plurality of responses in, looks like this:

public string Rearrange(string sentence, int numWords)
{
    string[] words = sentence.Split(' ');
    string answer = "";
    for (int i = words.Length - numWords; i < words.Length; i++)
    {
        answer += words[i] + " ";
    }
    for (int j = 0; j < words.Length - numWords; j++)
    {
        answer += words[j] + " ";
    }
    return answer.Trim();
}

It's not an amazing solution; there's a faster and more efficient way to do it that doesn't require splitting up the input into a words array at all (I'll leave that as an exercise to the reader). But it gets the job done, and most people end up thinking about the problem in this way. Yet a curious thing happened when I started interviewing current undergraduate students for the intern positions: most of them ended up using Python, and all of Python solutions used negative list indexing and ended up going off the rails because of it. I understand why they immediately thought to use negative indexing; we are asking them to grab things off the end of a sentence, after all. But it consistently sowed confusion, and never clarified.

There were two main problems that I saw when people used negative list indexing. One problem was that the interviewees would start with -1 and then decrement to -2, then -3, etc. This seems logical, but you end up reversing the order of the words in the rear segment, such that your output sentence from the above example becomes "fox brown dark a That is". What you really need to do is count "down" through the words in reverse, in the order -4, -3, -2, -1. But that's a bit harder to wrap your mind around than avoiding negative indexes entirely and simply counting up through the words in the order 2, 3, 4, 5. In particular I saw a lot of reversed parameters to the range function, which ends up executing the inside of the for-loop zero times. The problem looks like it's dealing with the end of the list, but it's really not; it's dealing with the middle of the list and counting up through to the end of it, for which using positive indexes is a lot more handy. But as soon as you say "end" to a novice programmer who's learned the "trick" of negative list indexes, that's where their mind immediately goes, and they tend to get stuck.

The second main problem with negative list indexing is that it doesn't mesh well with positive list indexing, both of which you end up needing to use to solve this problem. For instance, [0] is the first element in a list, and [1] the second, but [-1] isn't the second-to-last element in the list; it's the last one. This caused a lot of confusion. Positive list indexes are zero-indexed but negative list indexes are one-indexed. Fencepost errors abounded, or what you might even call reverse-fencepost errors, where interviewees expressed the index of the start of the second segment as either [-numWords +1] or [-numWords +-1] (both wrong but in different ways) when it should simply be [-numWords]. The interviewees simply didn't have enough experience using the two together to understand how they worked in unison, whereas when I prompted them to write the boundary conditions of both for-loops using positive indexing, the mental confusion usually went away and they were able to come up with a workable solution.

For the record, here is a correct solution in Python that uses negative list indexing. It doesn't look that tricky when you see it presented as an answer, but trust me when I say that no one was coming up with it on their own without access to a Python interpreter. The boundary condition of 0 on the first range in particular was highly un-intuitive to all of the interviewees.

def rearrange(sentence, numWords):
    words = sentence.split()
    answer = ""
    for i in xrange(-numWords, 0):
        answer += words[i] + " "
    for i in xrange(0, len(words) - numWords):
        answer += words[i] + " "
    return answer.strip()

And there's one final point that I can't help but bring up: reverse traversal of lists is cache inefficient. Granted it doesn't matter for this problem, and it probably won't matter for most things you do in Python, but if you're trying to write highly performant code in C/C++, like for scientific computation or a game engine, you're going to cause a lot more cache misses by trawling through memory in reverse order.

And the spammers have won (comments are now disabled)

Friday, December 13th, 2013

The problem with spammers on the web that I talked about before has continued getting worse and worse, to the point where hundreds of spam comments (and no identifiably non-spam comments) have made it through onto this blog. Akismet wasn’t doing the job, and I don’t maintain this blog often enough to keep all of the spam out. My only fix is to disable all comments. It sucks, because the two defining features that characterize a blog are (a) being personal and opinionated and (b) allowing multi-way communication between the author and the readers, and the readers with each other. I’m no longer doing half of that, so in some very real senses this is no longer a blog (though feel free to still email me if you want to chat). I’m really sad about it. But the spammers have won. It’s no longer feasible to try to maintain your own blog out there. Just host it with WordPress or similar and let someone else handle the spam problem, because I can’t stay on top of it any longer.

Using LINQ in C# to easily get a list of most often used words

Monday, May 20th, 2013

A pretty common programming interview question is to parse an input sentence and return the list of unique words used in that sentence. A further elaboration on that problem, the one that this post will be addressing, is to additionally calculate the number of occurrences of each word, and then return the top K words for some input value K. I’ll be demonstrating a simple solution to that problem in C#, both because I’ve been using it a lot recently and also because the choice of C# gives us access to LINQ, which is a powerful C# language feature that allows queries on collections using a SQL-like syntax. The top K problem is incredibly easy to solve in SQL, and boils down to SELECT TOP @K ..... FROM Words ORDER BY Word.Name DESC. The C# solution is similarly easy.

First, I’ll clarify the assumptions that I’m using (and that would be wise for an interviewee to address if none of these are made explicit):

  • I’m writing my solution to be case-insensitive. “case”, “Case”, and “CASE” will all thus count as the same word.
  • I’m not dealing with ties. If you ask for the top 3 words by occurrence and there are 5 words that are all used the most in equal numbers, then you’re still only going to get three words of those five, selected in no particular manner.
  • I’m going to use the C#’s language spec definition of a non-word character in regular expressions to separate the input sentence into words. The naive solution would be to split the string on only spaces, but then you’re not handling punctuation correctly.

And for the solution:

Dictionary<string, int> GetTopKWords(string input, int k)
{
	string[] words = Regex.Split(input, @"\W");
	var occurrences = new Dictionary<string, int>();
	
	foreach (var word in words)
	{
		string lowerWord = word.ToLowerInvariant();
		if (!occurrences.ContainsKey(lowerWord))
			occurrences.Add(lowerWord, 1);
		else
			occurrences[lowerWord]++;
	}
	return (from wp in occurrences.OrderByDescending(kvp => kvp.Value) select wp).Take(k).ToDictionary (kw => kw.Key, kw => kw.Value);
}

The vast majority of this code is responsible simply for finding the list of unique words the number of occurrences of each one. Once that is known, finding the top K words is a single line of code thanks to the power of LINQ. All we’re doing is ordering the words by frequency in descending order and taking the top K words. Without LINQ, there would be significantly more book-keeping code required to do this, which would make a good exercise for the reader (e.g. solve this problem in Java). The first roadblock you’ll probably run into is that you can’t simply flip the keys and values of the array, because the frequency counts aren’t unique. The best I’ve come up with is to construct a list of ordered tuples out of the dictionary of words, order it on the occurrence count part of each tuple, and then extract the first K elements from the resulting ordered list and return it. Let me know in the comments if you have a better solution.

Oh, and here’s an example input/output for the program, handling the display of output using LINQPad:

var input = "the quick brown fox is brown and jumps over the brown log over the long fire and quickly jumps to a brown fire fox";
GetTopKWords(input, 10);

outputs:

Top K Words sample run

Fixing an error with being unable to add user fields in Drupal 7

Wednesday, April 4th, 2012

Drupal 7 has a nice built-in Fields functionality which can be used to add fields to any entity. As applied to users, this replaces the previous Profile module which was used in Drupal 6 to add fields to users. However, right after creating a new Drupal 7 site, I couldn’t figure out how to add fields to users.

I followed these instructions

Just go to:

Configuration > People > Account Settings

Then click on the Manage Fields tab, and then just manage the fields just like you would for a Content Type.

But there wasn’t a “Manage Fields” tab on that page. Going to the URL directly, admin/config/people/accounts/fields, was redirecting me back to the Account Settings page. After banging my head against the wall for ten minutes I finally realized that the reason I couldn’t see this tab was because the “Field UI” module wasn’t enabled. Go to the List Modules page, enable that module (and its dependencies), and now you should be able to add fields to users.

Reminiscing about the naïve, spam-free days of the web

Tuesday, July 21st, 2009

Remember a long time ago when the web was free of spam? I’m not talking about email, which has had spam problems for awhile, I’m talking about the web. Nowadays, the web is festering with link-crawling spambots. Anyone with a blog, Twitter account, or heck, even a webpage with a simple submit form with some text fields on it, knows this. There’s not much that can be done about it besides spam-detection heuristic algorithms and CAPTCHAs.

Well, I just recently found some code that I wrote way back in 2002 that displays a blissfully unaware naïvité of what was to come. That code was part of my website Fyre’s Domain, which I have since put an archived copy of online. I had just been learning Perl CGI and I wanted to write a simple guestbook/comments form that readers could use to give me feedback without having to use a mailto: link. This was in the era before blogging software was commonplace — what I was running was a home-brew blog, but before the word “blog” was even invented. I basically copied the format from one of the first chat rooms I ever used, Paddynet, way back in 1995 or so. The “chat room” consisted of an HTML form that would dump unvalidated input (including HTML tags) into a chat buffer displayed on the page that would display the last 30 or so messages.

Paddynet was around long before spambots, but my site was started right when they began appearing in the wild, and the code proceeded to run for another 7 years until I just shut it off.

You can probably guess what happened.

The only reason I even re-discovered this code is because I happened to notice it was getting an unusual number of hits in my web analytics software. And those hits were anything but benign. My poor naïve Perl CGI comments submission form has accumulated 26 MB worth of text over the years, all of it spam. And since I figure it may be interesting to someone to see exactly what seven years of web spam looks like, you can download it for yourself (a text file thankfully compressed down to just 1.8 MB). If anyone finds any interesting trends in popular spam topics over the years in there, do let me know.

So those are the dangers of trusting user input on the web these days. Revel in the blissful simplicity of the following code, which was all it took to implement a comment submission system back in the day. Nowadays you couldn’t get away with anything even close to it. As my data proves, you’ll be eaten alive by spambots.

#!/usr/bin/perl

use CGI qw(:standard);

print header;
print start_html('Leave Comments on Fyre'),
	h1('Leave Comments on Fyre'),
	start_form,
	"<i>Note, all fields are optional, but empty comments will be ignored.</i><br>", 
	"Name: ", textfield(-name=>'name',-default=>''),
	"E-mail: ", textfield(-name=>'e-mail',-default=>''),
	"Your Comments: <br>", textarea(-name=>'Comments',-rows=>10,-columns=>50,-default=>''),
	'<br>',
	submit('Submit'), reset,
	end_form,
	p,
	hr;

if (param() && param('name') ne '' && param('Comments') ne '') {
	$date = `date '+%H:%M %m/%d/%Y'`;

	print '<i>Your comment has been posted.</i><hr><br>';
	@foo = "\n\n" . '<br><b>' . param('name') . '</b> ' . "\n" .
	'<u>' . param('e-mail') . '</u> ' . "\n" . '<i>' . $date . '</i>' . 
	"\n" . '<table><tr><td width = "100%">' . param('Comments') . 
	'</td></tr></table><hr>';
	push @foo, `cat mk.txt`;
	open CFILE, ">mk.txt" or die "Failed to open comments file!";
	print CFILE @foo;
	close CFILE;
}

@foo = `cat mk.txt`; print @foo;

print 'This program is open source, and released under the GPL by Ben McIlwain, 2002.  See 
the source <a href = "mk_script.txt">here</a>.';
print end_html;

My once-tiny GNU/Linux desktop morphs beyond all recognition

Tuesday, April 14th, 2009

Enermax Chakra
Almost a year ago, I bought a cute little desktop from Dell with the intent of using it as a GNU/Linux desktop alongside my existing Windows desktop. Its name is Vertumnus. But things don’t always turn out as planned. I quickly started using Vertumnus as my exclusive desktop PC, booting the Windows machine only to play games. Eventually I reformatted the Windows computer and the only applications I’ve reinstalled have been games, so it’s pretty much reduced to a gaming appliance at this point, like an XBOX360 but better.

The only problem is that when I originally bought Vertumnus, I didn’t have all of this in mind, and so I bought it rather under spec. I would’ve been better off just buying a better computer from the get-go. As a result, I’ve had to do quite a few upgrades over the past year to get it to meet my needs. From the very beginning I added more RAM and another hard drive. Then it joined a Stand Alone Complex. Then I added another hard drive. From the outside it still looked the same, but a lot of the interior was upgraded. Now even that is no longer true.

Yesterday, I spent two hours (and another $160) redoing the computer even further. The case was too cramped and was preventing further upgrades. So I moved the computer into a new case, the Enermax Chakra. It’s appreciably bigger than the previous Dell case. It’s also a lot more flexible on the inside in terms of which parts will fit into it. Why the Chakra? I only had two criteria, but the Chakra was pretty much the only case that met both of them: 1) It had to have a 250mm fan, but 2) No LEDs. Both criteria come from my computer living in my bedroom: it has to be silent (hence a big, slow-spinning fan) and it has to be dark, so that I can sleep!

Since the case didn’t come with any fans besides the huge 250mm one, I purchased two of the quietest 120mm fans in existence, the Scythe Gentle Typhoon. Again, my criteria were the same: Quiet and no LEDs. The Gentle Typhoons best met those. I also had to get a new power supply, because the 250 Watt one from Dell isn’t able to accommodate the video card I was about to put in. So I went with the Corsair 550W PSU. It was the power supply that best met my criteria: High efficiency (85%!), quiet (a big 120mm fan), and no LEDs. And it’s more than enough to power the video card that I put in, a hand-me-down GeForce 8800 GTS. Yes, that’s right, I finally got tired of the inferior performance of the Intel integrated graphics card. Now I can actually play modern 3D games in GNU/Linux.

And as if all that wasn’t enough, while transitioning all of the parts from one case to another, the CPU fan developed a faulty bearing which makes it obnoxiously loud. So the first thing I hear upon starting up my supposed-to-be-silent computer is a loud whirring fan noise. Rather than giving up my dreams of a silent computer, I ordered a replacement CPU fan/heatsink, the Arctic Cooling Freezer 7 Pro. Why that one? I already have one in my Windows computer and it cools really well. Plus it’s quiet. It hasn’t arrived yet, but it’s going into Vertumnus as soon as it does.

The new GeForce 8800 GTS is so large that it covers up one of the SATA ports on the Dell motherboard (and another one is rendered inaccessible to all but right-angle SATA connectors). Since I have three SATA hard drives and one SATA DVD-R drive, that’s a problem. The DVD drive is currently unplugged, but I’ll swap it out for an IDE DVD-R drive from my Windows desktop soon — thankfully, the video card doesn’t block the IDE port.

Once all of this is done, the only original parts that will remain in Vertumnus from the original purchase will be the Intel Core 2 Duo E7200 processor, 2 1 GB sticks of DDR2 RAM, the motherboard, and one 500 GB hard drive. And that’s after less than one year. Clearly, I tried saving too much money by buying a system far below my ultimate desired specifications, then wasted a bit more than those savings on upgrades. And I can’t even say the upgrades are done. At some point I’m going to need another hard drive, but since I’m all out of SATA ports, I’ll either have to get an add-in card or replace the motherboard. The original RAM that Dell shipped was pretty slow, and can easily (and cheaply) be replaced with something better. And the processor is looking slightly anemic. A nice quad-core processor would be fun to play around with …

Long story short, in another year, it’s quite possible that the only component remaining from my original purchase will be the 500 GB hard drive and a SATA cable or two. I guess I learned my lesson. Don’t try to save too much money on a computer if, at heart, you’re really just a techie who demands performance.

Why I use Identi.ca and you should too

Sunday, March 22nd, 2009

Those of you following me on Twitter may have noticed that all of my tweets come from Identica. I started off with Twitter but I quickly switched over to Identica as soon as I learned about it. Identica, if you haven’t heard of it before, uses the same micro-blogging concept as Twitter (and in fact is compatible with it), but has several improvements. I recommend Identica, and if you aren’t using it yet, check out these reasons as to why you should.

There are several practical reasons you should use Identica:

  • All of your data is exportable on Identica, including your entire corpus of tweets. Twitter does not provide this functionality. Should you want to migrate away from Twitter down the road (for any variety of as-of-yet-unforseen reasons), you are unable to do so, but you are able to migrate away from Identica at any point easily. And since Identica uses the Free Software Laconica software, you can even install Laconica on your own web host and import all of your data there, where you can have complete control over it.
  • Identica has a powerful groups feature that allows people to collectively subscribe and see all tweets sent to a group (this is what the exclamation syntax you may have seen in tweets is about). Groups are a powerful way to build communities and have multi-party discussions, but Twitter does not have them.
  • You don’t have to quit Twitter. My Identica account is linked to my Twitter account, so every message that I send to Identica automatically appears on Twitter. Posting to Identica+Twitter takes the same amount of effort as posting to Twitter alone, except it is seen by more people.
  • Identica lets you see things from other people’s perspective. I’ll use me as an example. You can see my entire tweet stream, which includes messages from all users and groups I’m following. This should give you a great idea of the kinds of things I’m interested in. And you can see all of the replies to me, which makes it a lot easier to track and understand conversations. Note that all of this is public information and is accessible on Twitter through trickier ways (in the first case, looking at the list of a person’s followers and combining all their tweets in chronological order; in the second case, by searching for “@username” on the search subdomain), so you aren’t giving up any of your privacy. Identica simply makes these features a lot easier to use.
  • Some people you may end up finding and wanting to talk with don’t use Twitter at all; they’re only on Identica. Get on Identica and link it to Twitter and you can talk to everyone on both services. Just use Twitter, however, and you’re left out in the cold with regards to anyone who only uses Identica.

And there is one important ethical reason you should use Identica:

  • Identica is Free (as in freedom, not merely cost). Because it follows the Free software ethos, it respects your rights and maximizes your freedom to control your data as you see fit, including the ability to move all of your data elsewhere if necessary. Twitter does not respect these freedoms.

Australia blocks my page from their Internet

Wednesday, March 18th, 2009

A couple years ago, when I was more active on Wikipedia than I am now, I was trying to prove a point by compiling a list of all of the risque images on Wikipedia (link obviously NSFW). I don’t quite remember what that point is anymore, but the list remains. It has even survived a deletion attempt or two. I stopped maintaining it a long time ago, but for whatever reason, others picked it up and continued adding more pictures in my stead. I haven’t thought of it in awhile.

So imagine my surprise when I learn that that silly page has made Australia’s secret national Internet censorship blacklist. I don’t understand the justification here — all of these images are hosted on Wikimedia servers, after all — but I have to laugh when I imagine some Australian apparatchik opening a report on this page, viewing it, making the determination that it’s not safe for Australian eyes, and adding it to the list without further thought, mate.

Australians, please take back control of your country.

A Python script to auto-follow all Twitter followers

Tuesday, March 10th, 2009

In my recent fiddling around with Twitter I came across the Twitter API, which is surprisingly feature-complete. Since programming is one of my hobbies (as well as my occupation), I inevitably started fooling around with it and have already come up with something useful. I’m posting it here, so if you need to do the same thing that I am, you won’t have to reinvent the wheel.

One common thing that people do on Twitter is they follow everyone that follows them. This is good for social networking (or just bald self-promotion), as inbound links to your Twitter page show in the followers list of everyone that you’re following. You’d think Twitter itself would have a way to do this, but alas, it does not. So what I wanted to do is use a program to automatically follow everyone following me instead of having to manually follow each person.

Other sites that interface with Twitter will do it for you (such as TweetLater), but I’m not interested in signing up for another service, and I’m especially not interested in giving out my Twitter login credentials to anyone else. So I needed software that ran locally. A Google search turned up an auto-follow script written in Perl, but the download link requires registration with yet another site. I didn’t want to do that so I decided to program it for myself, which ended up being surprisingly simple.

My Auto-Follow script is written in Python. I decided to use Python because of the excellent Python Twitter library. It provides an all-Python interface to the Twitter API. You’ll need to download and install Python-Twitter (and its dependency, python-simplejson, if you don’t have it already; sudo apt-get install python-simplejson does the trick on Ubuntu GNU/Linux). Just follow the instructions on the Python-Twitter page; it’s really simple.

Now, create a new Python script named auto_follow.py and copy the following code into it:

#!/usr/bin/python
# -*- coding: utf-8 -*-
#(c) 2009 Ben McIlwain, released under the terms of the GNU GPL v3.
import twitter
from sets import Set

username = 'your_username'
password = 'your_password'
api = twitter.Api(username=username, password=password)

following = api.GetFriends()
friendNames = Set()
for friend in following:
    friendNames.add(friend.screen_name)

followers = api.GetFollowers()
for follower in followers:
    if (not follower.screen_name in friendNames):
        api.CreateFriendship(follower.screen_name)

Yes, it really is that simple. I’d comment it, but what’s the point? I can summarize its operation in one sentence: It gets all of your friends and all of your followers, and then finds every follower that isn’t a friend and makes them a friend. Just make sure to edit the script to give it your actual username and password so that it can sign in.

Run the script and you will now be following all of your followers. Pretty simple, right? But you probably don’t want to have to keep running this program manually. Also, I’ve heard rumors that the Twitter API limits you to following 70 users per hour (as an anti-spam measure, I’m guessing), so if you have more than 70 followers you’re not following, you won’t be able to do it all at once. Luckily, there’s a solution for both problems: add the script as an hourly cronjob. This will keep who you follow synced with your followers over time, and if you have a large deficit in who you follow at the start (lucky bastard), it’ll slowly chip away at it each hour until they do get in sync. In Ubuntu GNU/Linux, adding the following line to a text file in /etc/cron.d/ (as root) should do it:

0 * * * * username python /path/to/auto_follow.py >/dev/null 2>&1

This will run the auto_follow script at the top of each hour. You’ll need to set the username to the user account you want the job to run under — your own user account is fine — and set the path to wherever you saved the auto_follow script. Depending on your GNU/Linux distribution and which cron scheduler you have installed, you may not need the username field, and this line might go in a different file (such as /etc/crontab). Refer to your distro’s documentation for more information.

So that’s it. That’s all it takes to automatically auto-follow everyone who’s following you — a dozen or so lines of Python, one crontab entry, and one excellent library and API. Enjoy.

Here’s a pretty bad Unicode WTF

Tuesday, March 3rd, 2009

I’m doing some research on Unicode and compression algorithms right now for a side-project I’m working on, and I came across a highly ranked Google search result for a UTF-8 munging code snippet that is so idiotic I couldn’t let it pass without comment. If this post helps even one person who would’ve otherwise followed the linked advice, it is worth it.

First, some background. UTF-8 is a character encoding format that can pretty much handle any character under the Sun, from the English alphabet to Japanese kanji to obscure extinct languages. It even includes thousands of esoteric symbols used in smaller fields of study that you’ve probably never even heard of before. But the nice thing about UTF-8 is that it is variable-length. Standard ASCII characters (including everything on a standard English keyboard) only take one byte to represent. All of the common characters from other widely used languages typically take just two bytes to encode. It’s only the really obscure characters that require more than two bytes.

So now you see why the linked “solution” is so stupid. This guy says he is “designing a little client/server binary message format” and wants “a simple and quick way to encode strings”. Well, duh — use UTF-8, no ifs, ands, or buts about it. It’s simple, quick, and already implemented in any programming language you can think of, so it requires no additional coding. There are all sorts of dumb ways to unnecessarily reinvent the wheel in sotware engineering, but trying to come up with your own character encoding is particularly idiotic. It’s really tricky to get right because there are so many corner cases you’ll never even know existed until they cause your application to break. The Unicode Consortium exists for a reason — what they do is hard.

This guy even confesses that his expected input will probably not contain Unicode characters that are longer than 2 bytes. So there is no justification at all for what he does next — he creates a mangled version of UTF-8 that turns all Unicode characters 3 bytes and longer into question marks, instead of just leaving them as is. So instead of allowing a rare character to take an additional byte or two, it gets mangled. And to accomplish this, he has to create his own custom encoding solution that is an intentionally broken version of UTF-8. That’s the worst part — he’s wasting time creating completely unnecessary code, that will need to be maintained, that will need to be debugged — and for what?

Of course, none of the people responding to his thread point out that what he is trying to do is stupid. They just smile and hand him some rope.