Finding Great (and Profitable) Ideas in the Computer Science Literature

I spend quite a bit of time trawling through recent computer science papers, looking for anything algorithmic that might improve my team’s product and Help People Get Jobs. It’s been a mixed bag so far, often turning up a bunch of pretty math that won’t scale at Indeed. But looking through the computer science literature can pay off big, and more of us should use the research to up our game as software developers.

inverted-index-words

Word cloud generated by WordItOut

Why read a computer science paper

The first question you might ask is why? Most working developers, after all, simply never read any computer science papers. Many smart developers look at me blankly when I even suggest that they do a literature search. “You mean look on StackOverflow?”

The short answer: to get an edge on your problem (and occasionally on your competition or your peers).

Some academic is looking into some deep generalization of whatever problem you are facing. They are hungry (sometimes literally, on academic salaries) to solve problems, and they give away the solutions. They are publishing papers at a ferocious pace, because otherwise their tenure committees will invite them to explore exciting opportunities elsewhere. Academics think up good, implementable approaches and give them away for free. And hardly anyone notices or cares, which is madness. But a smart developer can sometimes leverage this madness for big payouts. The key is knowing how to find and read what academics write.

Finding computer science papers

Thousands of computer science papers are published each year. How do you find a computer science paper worth reading? As with so many questions in this new century, the answer is Google, specifically Google Scholar.

As near as I can tell, Google Scholar includes almost all the academic papers ever written, for free. Almost every computer science paper since Alan Turing is accessible there. With Scholar, Google is providing one of the most amazing resources anyone has ever given away. Some links point to papers behind paywalls, but almost all those have extra links to copies that aren’t. I’ve read hundreds of papers and never paid for one.

Google doesn’t even attempt to monetize it. Nobody in the general public has heard about scholar.google.com. More surprisingly: according to my Google contacts, not many Googlers have heard about it either.

With Google Scholar, you’ve solved the problem of finding interesting papers.

Filtering computer science papers

Next, the problem is filtering and prioritizing the interesting papers you find.

Google Scholar search algorithms are powerful, but they aren’t magic. Even your best search skills will net you too many papers to read and understand. The chance that you are reading the one that will most help your work is small.

Here’s my basic strategy for quickly finding the best ones.

First, figure out the paper’s publication date. This seems like an obvious bit of metadata, but you’ll rarely find the date on the paper itself. Instead, look for clues in Google Scholar. You can also assume that it’s two years after the latest paper listed in the citations. This seems sloppy, but it’s effective. Computer science papers older than fifteen years are unlikely to contain anything of value beyond historical interest.

Next, read the first paragraph of the paper. This paragraph covers the problem the researchers are trying to solve, and why it’s important. If that problem sounds like yours, score! Otherwise, unless the authors have hooked you on the intrinsic interest of their results, dump it and move on to the next paper.

If things still seem promising, read the second paragraph. This paragraph covers what the authors did, describes some constraints, and lets you know the results (in broad strokes). If you can replicate what they did in your environment, accept the constraints, and the results are positive, awesome. You’ve determined the paper is worth reading!

How to read a computer science paper

The biggest trick to reading an academic paper is to know what to read and what not to read. Academic papers follow a structure only slightly more flexible than that of a sonnet. Some portions that look like they would help you understand will likely only confuse. Others that look pointless or opaque can hold the secrets to interpreting the paper’s deeper meanings.

Here’s how I like to do it.

Don’t read the abstract. The abstract conveys the gist of the paper to other researchers in the field. These are folks who’ve spent the last decade thinking about similar problems. You’re not there yet. The abstract will likely confuse you and possibly frighten you, but won’t help you understand the topic.

Don’t read the keywords. Adding keywords to papers was a bad idea that nonetheless seems to have stuck. Keywords tend to mislead and won’t add anything you wouldn’t get otherwise. Skip ‘em, they’re not worth their feed.

Read the body of the paper closely. Do you remember the research techniques your teachers tried to drum into you in eighth grade? You’ll need them all. You’re trying to reverse engineer just what the researchers did and how they did it. This can be tricky. Papers tend to leave out many shared assumptions behind the research, as well as many details and small missteps. Read every word. Look up phrases or words you don’t know — Wikipedia is usually fine for this. Write down questions. Try to figure out not just what the researchers did, but what they didn’t do, and why.

Don’t read the code. This is counterintuitive, because the clearest way software developers communicate is through code — ideally with documentation, revision history, cross-references, test cases, and review comments.

It doesn’t work that way with academics. To a first approximation, code in academic papers is worthless. The skills necessary to code well are either orthogonal to or actively opposed to the skills necessary for interesting academic research. It’s a minor scandal that most code used in academics is unreviewed, not version-controlled, lacks any test cases, and is debugged only to the point of “it didn’t crash, mostly, today.” That’s the good stuff. The bad stuff is simply unavailable, and quite probably long-deleted by the time the paper got published. Yes, that’s atrocious. Yes, even in computer science.

Read the equations. Academics get mathematics, so their equations have all the virtues that software developers associate with the best software: precision, correctness, conciseness, evocativeness. Teams of smart people trying to find flaws offer painstaking reviews of the equations. In contrast, a bored grad student writes the code, which nobody reads.

Don’t read the conclusions section. It adds nothing.

Leveraging a computer science paper for further search

Academic papers offers a bounty of contextual data in references to other papers. Google Scholar excels at finding papers, but there’s no substitute for actually following the papers that researchers used to inform their work.

Follow the citations in the related work. Authors put evocative descriptions of the work that matters to them in “Related Work.” This provides an interesting contrast for interpreting their work. In some ways, this section memorializes the most important social aspects of academic work.

Follow the citations in the references. Long before HTML popularized hypertext, academic papers formed a dense thicket of cross-references, reified as citations. For even the best papers, half of the value is the contents, half is the links. Citations in papers aren’t clickable (yet), but following them is not hard with Google Scholar.

Repeated citations of older papers? There’s a good chance those are important in the field and useful for context. Repeated citations of new papers? Those papers give insight into the trajectory of the subject. Odd sounding papers with unclear connections to the subject? They are great for getting the sort of mental distance that can be useful in hypothesis generation.

Once you’ve done all that…

It’s just a simple matter of coding. Get to it!

Dave Griffith is a software engineer at Indeed and has been building software systems for over 20 years.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone

Vectorized VByte Decoding: High Performance Vector Instructions

Data-driven organizations like Indeed need great tools. We built Imhotep, our interactive data analytics platform (released last year), to manage the parallel execution of queries. To balance memory efficiency and performance in Imhotep, we developed a technique called vectorized variable-byte (VByte) decoding.

VByte with differential decoding

Many applications use VByte and differential encoding to compress sorted sequences of integers. The most common compression method for inverted indexes uses this style of encoding. This approach encodes successive differences between integers instead of the integers themselves, using fewer bytes for smaller integers at the cost of using more bytes for larger integers.

A conventional VByte decoder examines only one byte at a time, which limits throughput. Also, each input byte requires one branch, leading to mispredicted branches.

Vectorized VByte decoding

Our masked VByte decoder processes larger chunks of input data — 12 bytes — at one time, which is much faster than decoding one byte at a time. This is important for Indeed because Imhotep spends ~40% of its CPU time decoding variable-byte integers. We described this approach in a tech talk last year: Large Scale Analytics and Machine Learning at Indeed.

Jeff Plaisance (Indeed), Nathan Kurz (Verse Communications), and Daniel Lemire (LICEF, Université du Québec) discuss the masked VByte decoder in detail in Vectorized VByte Decoding. The paper’s abstract follows:

We consider the ubiquitous technique of VByte compression, which represents each integer as a variable length sequence of bytes. The low 7 bits of each byte encode a portion of the integer, and the high bit of each byte is reserved as a continuation flag. This flag is set to 1 for all bytes except the last, and the decoding of each integer is complete when a byte with a high bit of 0 is encountered. VByte decoding can be a performance bottleneck especially when the unpredictable lengths of the encoded integers cause frequent branch mispredictions. Previous attempts to accelerate VByte decoding using SIMD vector instructions have been disappointing, prodding search engines such as Google to use more complicated but faster-to-decode formats for performance-critical code. Our decoder (MASKED VBYTE) is 2 to 4 times faster than a conventional scalar VByte decoder, making the format once again competitive with regard to speed.

Vectorized VByte Decoding has been accepted to the International Symposium on Web Algorithms (iSWAG) on June 2-3, 2015. iSWAG promotes academic and industrial research on all topics related to web algorithms.

Large-scale interactive tools

To learn more about Imhotep, check out these tech talks and slides: Scaling Decision Trees and Large-Scale Analytics with Imhotep. You can find the source and documentation for Imhotep on GitHub.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone

Memory Mapping with util-mmap

We are excited to highlight the open-source availability of util-mmap, a memory mapping library for Java. It provides an efficient mechanism for accessing large files. Our analytics platform Imhotep (released last year) uses it for managing data access.

Why use memory mapping?

Our backend services handle large data sets, like LSM trees and Lucene indexes. The util-mmap library provides safe memory mapping of these kinds of large files. It also overcomes known limitations of MappedByteBuffer in the JDK.

Memory mapping is the process of bringing part of a file into a virtual memory segment. Applications can then treat the mapped part like primary memory. We use memory mapping in latency-sensitive production applications that have particularly large files. By doing so, we prevent expensive I/O operations.

Limitations with MappedByteBuffer

The JDK provides MappedByteBuffer in the java.nio package for doing memory mapping. This library has three main problems:

Unable to safely unmap
The only way to request unmapping with MappedByteBuffer is to call System.gc(). This approach doesn’t guarantee unmapping and is a known bug. You must unmap a memory mapped file before you can delete it. This bug will cause disk space problems when mapping large, frequently-updated files.

Unable to map files larger than 2GB
MappedByteBuffer uses integers for all indexes. That means you must use multiple buffers to manage files that are larger than 2GB. Managing multiple buffers can lead to complicated, error-prone code.

Thread safety
ByteBuffer maintains internal state to track the position and limit. Reading using relative methods like get() requires a unique buffer per thread via duplicate(). Example:

public class ByteBufferThreadLocal extends ThreadLocal<ByteBuffer>
{
    private ByteBuffer src;
    public ByteBufferThreadLocal(ByteBuffer src)
    {
        src = src;
    }

    @Override
    protected synchronized ByteBuffer initialValue()
    {
        return src.duplicate();
    }
}

Memory mapping with util-mmap

util-mmap addresses all of these issues:

  • implements unmapping so that you can delete unused files immediately;
  • uses long pointers, so it is capable of memory mapping files larger than 2GB;
  • works well with our AtomicSharedReference for safe, simple access from multiple threads.

Example: memory mapping a large long[] array

Use Guava’s LittleEndianDataOutputStream to write out a binary file:

try (LittleEndianDataOutputStream out =
        new LittleEndianDataOutputStream(new FileOutputStream(filePath))) {
    for (long value : contents) {
        out.writeLong(value);
    }
}

Use  MMapBuffer to memory map this file:

final MMapBuffer buffer = new MMapBuffer(
       filePath,
       FileChannel.MapMode.READ_ONLY,
       ByteOrder.LITTLE_ENDIAN);
final LongArray longArray =
    buffer.memory().longArray(0, buffer.memory().length() / 8);

Why not use Java serialization?
Java manages data in big-endian form. Indeed’s production systems run on Intel processors that are little endian. Also, the actual data for a long array starts at 17 bytes into the file, after the object header.

To properly memory map a native Java serialized array, you would have to write code to manage the above mentioned offset correctly. You would also have to flip the bytes around, which is expensive. Writing data in little endian results in more straightforward memory mapping code.

Thread Safety

For safe access from multiple threads, use AtomicSharedReference. This class wraps the Java object that’s using the memory mapped file. For example:

final AtomicSharedReference<LongArray> objRef =
    AtomicSharedReference.create(longArray);

The objRef variable is a mutable reference to the underlying SharedReference, a ref-counted object. When using the array, you must call getCopy() and then close the reference.

try(final SharedReference<LongArray> myData = objRef.getCopy())  {
    LongArray obj = myData.get();
    // … do something …
}

SharedReference keeps track of references and unmaps the file when none are still open.

Reloads

Use the setQuietly method to replace newer copies of the file.

final MyObject newMyObj = reloadMyObjectFromDisk();
objRef.setQuietly(newMyObj);

Close

Use closeQuietly upon application shutdown to unmap the file.

objRef.closeQuietly();

Get started with util-mmap

At Indeed, we use util-mmap in several production services. We are using it to access files that are up to 15 GB and updated every few minutes. If you need to memory map your large files, visit us on GitHub and give util-mmap a try.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone