Simple performance tuning of a Lua library

The pure-lua png reader I’ve found on the Internet worked great - but only for the icon-sized images. It could take minutes (plural!) to load 100 square pixels. That forced me to roll up my sleeves and look into the code…

The pngLua Library

The story began with the search for an image reading library written in Lua for one of my teaching projects. The choice was surprisingly limited; there were a couple of all-purpose monsters that, in addition to reading any image format on Earth, can rotate the image, extract feature vectors, fly to the moon, do some household chores. But I needed something simple and there was one: pngLua. It had all the features I wanted:

  • lightweight, pure Lua,
  • it could load a .png file,
  • very straightforward API (image size, RGB of any pixel at (x, y)).

And it worked quite well when I only used it on the 32x32 images. Suddenly, when I switched to 500x500, everything just froze. It took me a while to even narrow the problem down to image loading, because the whole program was quite complex and the image loading was the last thing I was willing to blame for the freeze.

Still, after some tracing and bisection, I had to believe the evidence: loading of a mere 500x500 png image got stuck. A few ^C attempts revealed a suspicious pattern:

^Clua5.1: ./stream.lua:82: interrupted!
stack traceback:
	[C]: in function 'concat'
	./stream.lua:82: in function 'writeChar'
	./stream.lua:95: in function 'writeByte'
	./png.lua:203: in function '__init'
	./30log.lua:13: in function <./30log.lua:10>
	(tail call): ?
	./png.lua:269: in function '__init'
	./30log.lua:13: in function <./30log.lua:10>
	(tail call): ?
	test.lua:7: in main chunk
	[C]: ?

The PNG Format

So the library was writing something. At first, I thought that’s the problem: why would an image reader want to write? However, after inspecting the code and the PNG file format, I found that writing was necessary in most common cases.

The PNG format employs lossless compression; to get better compression rates, it defines four filters (five, if you count “do nothing” as a filter) that “prepare” the byte stream for the compression algorithm in one way or another.

The most commonly used filter, Paeth, replaces each byte of the stream with the difference between said byte and some prediction function that operates on three “nearby” bytes. So to get the original value, one needs to reverse the encoding by replacing each byte with the result of the decoding function.

Slowness

After the initial investigation, it was easy to pinpoint the problem. The stream was implemented as a string and, since strings in Lua are immutable, changing one character in that string did cost a lot. Here’s the culprit:

...
Stream.data = ""
...
function Stream:writeChar(char)
    ...
    self.data = table.concat{self.data:sub(1,self.position-1), 
			     char, 
			     self.data:sub(self.position+1)}
    ...
end

For every byte in a 500x500 pixels image (approx. 1,000,000 bytes), the above line of code allocates and writes to at least 2Mb of memory! And that is multiplied by one million. No wonder it took eternity to finish.

I couldn’t wait until a 500x500 had been loaded in full, but I succeeded in measuring performance of the library on a smaller 283x150px image:

$ time lua5.1 test.lua
real	4m59,321s
user	5m4,112s
sys	0m0,592s
	Maximum resident set size (kbytes): 370840

It took 5 minutes and 360 MB to load a mere 0.4 megapixel image. Not too good.

The Fix

I wasn’t (and am not) very proficient in Lua; a quick googling around had revealed no easy way to get modifiable strings. However, why string at all?..

It had been easy enough to change the internal representation to a (modifiable!) table of characters (which are actually strings in Lua). The code from above became much simpler:

...
Stream.data = {}
...
function Stream:writeChar(char)
    ...
    self.data[self.position] = char
    ...
end

and the performance improvement was dramatic: ~250x speedup (one second instead of several minutes)!

$ time lua5.1 test.lua
real	0m1,200s
user	0m1,184s
sys	0m0,012s
	Maximum resident set size (kbytes): 61844

The Further Improvement

Since I already had spent enough time on that, I decided to see if it’s possible to squeeze some more out of the code. The obvious thing was to drop strings entirely - since we are working with bytes anyway. Instead of having a table of strings, have a table of numbers.

That additional change resulted in a 1.2x speedup (x300 compared to the original) and 3MB savings in memory consumption:

$ time lua5.1 test.lua
real	0m1,010s
user	0m0,980s
sys	0m0,020s
	Maximum resident set size (kbytes): 58432

That’s where I decided to stop; there were now low hanging fruits in sight anymore and it had already become good enough for my purpose. So I moved on to doing the thing I needed library for in the first place.

After a while, this work had been integrated into the library so that it’s useful to more people; here’s the patch, if you’re interested.

References

Maxim Kartashev

Maxim Kartashev
Pragmatic, software engineer. Working for Altium Tasking on compilers and tools for embedded systems.