Peterson Decoder

The summer after my sophomore year in college, I did an internship at OSRAM OptoSemiconductors GmbH, a leading manufacturer of LEDs.

Problem The company had the following problem in the manufacturing process: thousands of wafers go through the process each day, and they are tracked by manually writing down wafer numbers. However, at various stages of the process high-quality images are taken of each wafer for quality control of the chips being made. The goal of my project was to apply image processing techniques to perform optical character recognition (OCR) on the numbers and letters that uniquely identify a wafer to be able to automatically track the wafers through the production process.

Challenges of the OCR approach The images were taken by a microscope, which produced very low-contrast images where the difference between the foreground and background was within 10 gray-scale values. This property of images made it extremely difficult for a program to differentiate between the unique identifier of a wafer and the background. Furthermore, the images were taken at different brightness levels and the location of the unique identifier on the wafers was not consistent from wafer to wafer, which made the problem even more challenging.

Attempted Solutions I implemented various image processing algorithms to perform OCR:

  • cross-correlation -- samples of images of numbers and letters are compared to the characters in the test set of images of wafers,
  • differentiating foreground from background by calculating gradient in the gray-scale values,
  • edge detection, and others.

However, they produced inconsistent results due to the problems with the images mentioned above. I also used machine learning to train the system on 200 images for each letter and number. The images used for training were the images of wafers in production. The system performed at 75%-90% success rate on the test set, which was not satisfactory for the production requirements, as one of the main requirements is that the solution should be robust and reliable.

The Solution Once I discovered that none of the standard OCR techniques will produce satisfactory results, I realized that a different approach should be taken. Luckily for me, there was another way. When a wafer is labeled with a unique identifier, a data matrix is engraved onto the wafer which encodes the unique identifier (example). A Data Matrix is a two-dimensional square or rectangular matrix barcode that consists of black and white dots, which represent binary 0s and 1s. ECC200 data matrices, which is the standard I worked with, are Reed-Solomon codes, which contain redundant symbols in addition to the symbols in the label. They can detect and correct multiple random errors, up to ⌊r/2⌋ symbols, where r is the number of redundant symbols added to the code.

I noticed that although the location of the label varied from wafer to wafer, all wafers had one feature in common: the label was located on the flat edge of the wafer. So this was the solution to the problem:

  1. find the flat region of the wafer and the data matrix:

    ;

  2. decode the ECC200 data matrix.

We will focus on the most interesting part of this project which is how to decode the data matrix. During my internship I had a book written in German that described the general idea behind ECC200 data matrices and specified how they are encoded and mapped for some of the common sizes. However, the particular size of the data matrix used by OSRAM OS was unusual and was not described in the book. I reverse engineered the mapping of the symbols onto the data matrix and the encoding of it, based on the information gathered from the book and the official standard specification. The standard specification document states what irreducible reducing polynomial of a particular Galois field is used to encode the symbols in data matrix depending on the number of redundant symbols in the matrix. And now that we know what field we are in, we know where to go.

Each symbol in the label maps to its binary representation of 8 bits (usually, its ASCII code plus a magic number). In addition, the data matrix contains redundant symbols that enable error correction to recover the symbols on the label in case a part of the wafer's label is chipped off during the production. In order to further improve reliability of erasure recovery in the case of mechanical damage to the edges of wafers, the binary representation is mapped onto the data matrix in L-shaped configuration and some of the actual and redundant symbols are distributed across the edges of the matrix to further protect from the data matrix from losing too much information. See the diagram below for more information.

The data matrix consists of the frame on the borders and the middle of the matrix. For larger data matrices like this one the area is split into 2 segments. There are 10 symbols in the label which are mapped onto the wafer. In addition, there are 11 redundant symbols distributed across the data matrix resulting in a total of 21 symbols on the data matrix.

Once the symbols are correctly located, Peterson-Gorenstein-Zierler decoder is used to decode the label. I used the theoretical decoder described in the Wikipedia link to implement the decoder in MATLAB. The steps of the algorithm are outlined below:

  1. read the codeword and locate all the label symbols and redundant symbols;
  2. determine whether there are any errors in the codeword by calculating the syndrome;
  3. calculate number of errors present if detected they are present;
  4. calculate locations of errors;
  5. correct the errors, that is, find the "offset" of the symbol;
  6. correct the codeword and display it.

Decaf Compiler

Intermediate Representation

Optimizations

Sparse Conditional Constant Propagation

Coming soon...

Memory Allocator (Inspired By JEmalloc)

Memory management is a very interesting problem, which we have learned to solve well enough that most of the programmers these days don't have to worry about it thanks to advanced garbage collectors. When designing a heap allocator, the following concerns should be addressed:

  • speed at which allocations and deallocations are performed per unit of time,
  • space overhead -- space used by the allocator for bookkeeping,
  • internal fragmentation -- waste of memory due to allocating larger blocks than the user requested,
  • external fragmentation -- waste of memory due to inability to use storage because it is non-contiguous, solved by reallocating for compaction and coalescing the free blocks,
  • ability to allocate memory in parallel for better throughput.

Parallel allocators pose additional challenges, such as:

  • blowup in the case of parallel allocators -- additional waste produced beyond what a serial allocator would require,
  • loss of scalability due to lock contention,
  • false sharing when allocator satisfies memory requests from different threads using the same cache line.

So, we want a memory allocator that keeps the memory compact, uses as little additional memory as possible for bookkeeping, allocates and deallocates fast, and serves multiple requests at the same time, taking advantage of the thread local cache whenever possible.

Memory Allocators Let us take a look at three common memory allocation designs. The simplest kind to implement is a buddy memory allocator, which has little external fragmentation and fast compaction due to the ease with which coalescing of the freed blocks is achieved. It can suffer from internal fragmentation depending on the size of the lowest order memory block and the granularity of the block size. Slab allocation is commonly used to reduce the effects of internal fragmentation by allowing reuse of fixed-size blocks. Since majority of applications allocate and deallocate many objects of similar sizes, this strategy is very effective. The improved buddy allocator provides better time guarantees while keeping the distribution of fragmentation the same as the standard approach: worst-case constant time for allocation and worst-case/amortized constant time for deallocation instead of O(lg n). Linux kernel uses the buddy system with several modifications to further minimize external and internal fragmentation.

Other allocation techniques include bidirectional coalescing with boundary tags, invented by Donald Knuth and published in The Art of Computer Programming: Fundamental Algorithms, Addison-Wesley, 1973. The tags are appended on both sides with the information about the size of the block and whether it's allocated (allocation bit is 0 or 1). It makes coalescing easy as we can check both adjacent blocks and coalesce if either neighbor is free, then discard the inner .

Hoard Memory Allocator is a fast, scalable memory allocator that solves well the 3 problems that multi-threaded allocators face. The Hoard uses 1 global heap and 1 local heap per thread. Each local heap owns a number of superblocks. The Hoard system bounds the blowup by putting an emptiness threshold on each superblock and moving the superblock from the local to global heap once the threshold is crossed for further reuse of the superblock.

Lock contention problem on a local heap is solved by ensuring that a superblock is owned by only one local heap at any time and only one thread is allowed to allocate from a given superblock. When multiple threads allocate memory, their requests will be directed to different superblocks thus avoiding most of the per process lock contention scenarios. In the worst case when in a pair of threads one allocates and the other one deallocates from the same superblock, there is a 2x slowdown due to serialization of malloc()s and free()s, which is scalable as it does not grow with the number of threads. The lock on the global heap should be obtained in a few cases, such as initial creation of superblocks, moving superblocks in and out of the global heap, and freeing of the blocks in the superblocks held by the global heap. The paper shows that in a steady state the allocator incurs no lock contention on the global heap, and low contention during gradual growth.

Avoiding active false sharing is addressed by directing requests for allocation from multiple threads to different superblocks. When deallocation request comes in, the freed block of memory is returned to its superblock. This strategy prevents multiple threads from sharing a cache line. There is, of course, the case when a superblock's emptiness factor goes below the threshold and the superblock is now transferred to the global heap and later assigned to another local heap, which would cause the two heaps to share cache lines, but in practice the authors of the Hoard allocator observed that such event occurs very infrequently and most of the time a superblock is completely empty upon transfer to the global heap.

JEmalloc

Design

Source code

Coming soon...Thesis

Loom and Graphs in Clojure

Talk at LispNYC on August 13th, 2013