Can Machines Color in 3D

Texturing 3D Objects using Style Transfer and Generative Adversarial Networks

27 min readNov 18, 2017

Introduction

Let us say you give a toddler an outline of shapes from a coloring book, and this is the first time the toddler has ever seen a coloring book. You show the toddler a picture of the image they are supposed to create as well. Let us say the drawing they are supposed to color and the image they are supposed to create are the following:

This is what the toddler would probably create the first time.

However, the toddler shows this picture to an adult. The adult tells the toddler that the color should be red. The toddler takes this feedback and tries again.

The toddler gets feedback again from the adult. This time the adult tells her that the leaf has to be green and she needs to color within the lines.

However, it still is not quite right. She asks the adult again for feedback, and using the feedback creates this.

Eventually, with enough iterations the toddler will eventually be able to create an rendering that looks like the image she is supposed to match.

The process that the toddler went through is how a lot of deep learning works and gives us an analogy of the problem we tried to tackle as a part of the University of Southern California’s experimental deep learning course [1].

Now, let us make the toddler and adult neural networks, that form a model. In the problem we are trying to solve, the toddler is now being asked to color an object in 3D given a 2D image.

A clear and concise statement of the problem we explored is the following:

In this report, we share the results of our attempt to harness deep learning techniques to texture 3D objects using style transfer and Generative Adversarial Networks (GANs).

This problem has a lot of applications and a lot of potential. To motivate you, before you go down the rabbit hole, you may want to click Why is this Cool (or you may wait until you reach the conclusion to better understand some of the things explained in this section).

We tried out three different techniques to texture 3D objects. Our first and second methods were inspired by style transfer [2]. Third one was inspired by conditional GANs [3].

Method 1: Height Maps and CNN for Style Transfer

When we first started this project, we did a lot of research on style transfer (if you are unfamiliar with style transfer, here is a paper on it)[3]. But a summary of what it does is that it applies a style of one image to the content of another.

A is the content image, the small middle image is the style, B is the result [3]

We started out by reading through an implementation of style transfer that we found online [4]. We then ran style transfer on various images and tweaked parameters to better understand the way it works.

This problem seems like it could be solved using style transfer because we could treat the geometry as content and texture as style. However, in order to directly apply this technique we would need a way to represent the content in a 2D image. The easiest way to do that is through a height map.

A height map is a gray scale image image of a 3D model. The value of each pixel represents how high that pixel is in the 3D model. White (a value of 255) corresponds to high elevations, while black (a value of 0) corresponds to low elevations. For example, this image:

Maps to this image:

As you can see in this image, the darker values are lower and brighter higher

So we decided to apply style transfer to the height map of a 3D model that represented a landscape (the more technically correct term is “terrain”).

The result does have some noise due to the flowers in the style image, yet it also seems to have a distinctive pattern (darker tones off the style in the lower elevation/or dark parts of the height map).

In order to truly understand the effectiveness of this method, we would need to create a 3D visualization using the height map and texture. Using a visualization library called vispy, we managed to generate this visualization. The results are below:

**A, C, D** — Images of Final Result, B — Using height map image as texture for comparison

The texture does seem to work well from a bird’s eye view, but at certain angles it really needs an improvement. A hypothesis for the weak results at some angles is that color information is not available on the y axis (elevation axis in graphics). There is no variation of color on the y axis because the y axis is represented by color itself in the content image. This makes it reasonable to test out the next method, in which local information on the y-axis can be used. Additionally, height maps are one to one functions. Here is what a height map function looks like:

Each x and z pair can only map to one value. However, in many 3D objects, for example, a sphere, each x and z pair does not map to only one value.

Method 2 : Coloring voxels and CNN for Style Transfer

What are Voxels

Before we get into method 2, we want to provide a clear definition of “voxels” the unit we use to represent our 3D mesh in this approach.

Let us say you are given a model like this:

You can split the 3D model into a 3 dimensional grid like this:

The grid splits the space into a lot of cubes. There are certain cubes that have parts of the object in them, and there are other cubes that don’t. For example the green cube in the image below has the model in it, while the red cube does not.

The cubes that do have the object in them are labeled 1s. And the cubes that don’t have the model in them are labeled 0s. Thus, it is a 3D bit mask. Here’s a visualization of the 3D bit mask showing all the ones:

Each cube can also be called a voxel.

The Canonical Positions

The intuition behind method 2 is to apply Style Transfer on the different 2D views of a 3D model using a style image. In order to execute on this intuition, we have to figure out 2D views to use.

Our group decided to use views called the “canonical positions.”

If you treat the 3d space as a cube, the canonical positions are pictures of the object taken from the 6 sides of the cube.

Isometric view followed by the 6 canonical views

The Process

Now we’ll walk through the process that our group took for method 2. We first have to decide how we would like to represent the 3D models in a way that can be fed into the network. Among available options are 3D meshes and voxels. It is smarter to go with voxels because it is easier to feed into a convolutional network, because the representation is similar to an image.

After deciding on voxels, we need to find the canonical views of the voxels. Since voxels do not contain color information and style transfer is used on RGB images, we need to initialize the color of each canonical view.

We can start with initializing the foreground as white and the background as black. Then each of the 6 views can be fed into the same style transfer network described in Method 1 to get 6 “mixed image”results. After that, we can map the 6 “mixed image” results back into 3D space based on the voxels the pixels in content image are based on, resulting in a colored 3D model.

Since the content image’s features play a significant role while transferring style, we want to find the best color initialization of the content images.

To do so, our group came up with different approaches for feeding the content image into the network. Now, we’ll dive into each one.

Separate Views

This is the simplest approach, and was our group’s baseline for applying Style Transfer on views of a 3D object. We treat each view as a separate content image and applied style transfer on it. The results from this method are pretty decent as seen below.

Style Image (same for all other results for method 2 shown)

Views Grid

Moving on, we want to see if changing the way input is fed to the style transfer algorithm could improve the results. Therefore, we decide to combine the images in a grid-like manner as shown below before applying style transfer. The results for this method turned out to be darker than when using individual images separately. This is probably due to the grid image having a black background.

As seen in the output above, the results do not seem to be able to recognize and map certain colors to certain features, likely because the initial 2D projections do not have visible features due to it being a single solid color. We then proceed to try out different color initialization techniques for the 2D projections.

XYZ Coordinates as RGB

This approach is pretty straightforward and as its name suggests it just uses the X, Y, and Z coordinate values as R, G, and B respectively.

Voxel Densities of Different Window Sizes as RGB

This approach is the most interesting of the three we are going to use to initialize the colors. The idea behind it is to capture the distribution of voxels around the voxel being colored. By capturing the density of voxels in different 3D cubes around the voxel we are trying to color, we can determine whether that voxel is in a corner or in the middle of the object. In order to represent densities of the voxels, we assign each channel (R, G, B) a different cube size. We then take an average of the voxels (which are represented with 1 and 0 values) present in each of these cubes.

Representation of the voxel density using 3 windows for RGB

In the above figure, the blue channel is the average of the voxels in the (biggest) blue cube, green channel is average of the voxels in the green cube, and red is the average of the voxels in the red cube. We thus have the color of the corresponding voxel (grey cube) representing the distribution of voxels around it.

Below are the views our group generated using 4 different combinations of window sizes.

Content views using voxel densities of different window sizes as RGB

Below are the results from using the above views as content for Style Transfer.

Window sizes: Red: 1, Green: 9, Blue: 25

Window sizes: Red: 3, Green: 9, Blue: 25

From the above results, we see that the output mesh for running style transfer separately on each canonical view better resembles the style as compared to using a grid for the same window sizes. Also, capturing the distribution of voxels around voxels in the color initialization correlates local areas of the input (in the grid view as well as separately) and thus does give better results as we hypothesized.

Method 3

Another completely different way of approaching this problem, would be through a Conditional GAN(cGAN)[3]. In order to understand why this would work, let us take a look at what we defined the problem to be again.

As a brief reminder, in the formula above, the “textured object” is a colored mesh (or 3D object). The “geometry” is the 3D object without any color. The “texture” is an image that we want the network to color the 3D object with.

Speaking in the context of a cGAN, the network would generate a textured object, conditioned on the geometry (3d data) and the texture (2d image).

With previous techniques, we used the pre-trained parameters from the VGG-19 network, as style transfer is based on this network. However, for a cGAN approach, we would need to train the network on an actual data set.

Data Collection

The data set our group used for this project is from ShapeNet [5]. ShapeNet is a research effort to provide large data sets of 3D models. ShapeNet has two data sets available, ShapeNetSem and ShapeNetCore. ShapeNetSem is a bit smaller and comes with slightly different information. Our group decided to go ahead and use ShapeNetSem because ShapeNetCore would be two large for us to download (it is 30.1 GB!). ShapeNetSem comes with 12,000 models, 3354 of which our group used (due to incomplete data and non-orthogonal (not (1,0,0), (0,1,0), or (0, 0, 1))) up and front vectors.

Each 3D model in the ShapeNetSem data set comes with a lot of extra information associated with that 3d model. For our project, the extra information or “fields” that were important were:

The 3D models themselves in different formats. More explanation on what this file format is in a bit.
2D images of different viewpoints of a rendering of the 3D model.
Text information that identified an ID of the model, the “up vector,” and “front vector.”

Here’s a diagram of all the data that we need for one example model:

Here’s an explanation in more detail of what these different fields are and how our group preprocessed them so that we could end up using them.

Binvox

The binvox file format stores a 3D bit mask or the voxels of a 3D model.

ShapeNet’s binvox files stored a 128*128*128 grid. For computational efficiency we could to downsize this to a 32*32*32 grid. Here’s how our group did it:

After that we should center the voxels because a centered mesh will become useful later when we determine the color of these voxels from 2D images:

And then we can put it all together by reading the .binvox file and calling the two functions defined above:

We save this data for the voxel meshes in a number_of_models×32×32×32×1 matrix of boolean values.

Images

Each model comes with 14 images of a rendering of that model. The first 6 images are from the canonical positions. The last 8 images are the turn table images of the model. They are 8 images taken at a downward 45 degree angle where the camera is above the object like this:

Here is an example of the turn table positions of a rendering of a model of a lamp:

And here are all the images, canonical and turn table of this lamp:

The reason we need these images is because the network needs a colored version of the 3D model. Unfortunately, ShapeNet does not come with colored voxels (where each voxel stores an RGB value — essentially a 3D image). And unfortunately there isn’t any simple way to generate one. In fact this may have been the biggest hinderance to our project. Our group tried a few things to get this data before settling on a final approach.

The first technique our group tried out is called the voxel coloring algorithm [6]. Given the turn table images of a rendering of the model, using this algorithm you could color all the voxels visible to a turntable camera by project the colors onto the 3D model’s voxels. A visualization of this camera was shown previously, but here it is again:

However, in order to use this approach, you would have to know the angle these images are taken at. Unfortunately, ShapeNet doesn’t come with this information, and this could be computationally inefficient for a large data set. Therefore, we did not end up implementing this method.

The next option our group tried was to voxelize a colored version of the 3D model. ShapeNet comes with colored 3D models like this:

But not with colored voxel volumes like this:

Our group looked for tools that could generate colored voxels from textured 3D models. We found a few apps like CubikStudio (used to make the images above) and MagicaVoxel, but we’d have to manually voxelize each one using a GUI. And manually going through the mouse clicks to generate colored voxels on a dataset of over 6000 colored models is a huge waste of time. Therefore, our group did not use this method.

Our group did find one script that voxelizes 3D models with color [7]. Unfortunately, however this did not work on most of ShapeNet’s 3D models because the script did not support for formats of most of the .mtl file (the file used to color 3D models if they are not voxels).

We searched all over and asked on multiple forums (Stack Overflow, Cubik Studio, and even on the ShapeNet Forum).

Someone who was a part of the ShapeNet research effort finally got back to us with a response that would be super useful if the project was due one week later:

However, for now we have to settle on a different approach for obtaining the colored voxels.

Our group ended up projecting the canonical images onto the 6 sides of the object to get colored voxels, like this:

Although we managed to finally get this technique working there were still a lot of issues with it. One issue with this technique was concavity. For example, in the 3D model of the mug, we can see the bottom of the mug from the top view, but we cannot see the parts of the mug pointed to by the arrows in the image below (and this is also a problem in method 2 since it also uses the canonical positions).

Additionally, the images from the canonical positions are not all taken at the same distance from the object. For example this is the front view of one object.

And this is the bottom view for that same object:

Nonetheless, as this was the most effective technique our group had for the time given, so we decided to use it.

We saved the colored voxel representation of the 3D models in a matrix of dimension number of models×32×32×32×3.

We also needed one of the images that come with the model to represent the texture that we used. So we saved one of the images (the front view to be precise) for each model in a number of models×64×64×3 matrix.

Vectors and ID

ShapeNet comes with a csv file, or a table of model information containing the text information about the models.

The information in each row of the table contain the model id, up vector, and the front vector.

The up vector is the direction of up in a the 3d space that contains the model. And the front vector is the direction the front of the model is pointing towards. For example in the model below the green arrow is the up vector and the blue arrow is the front vector.

We thought these vectors were important because it would allow us to project the 2D images from the canonical positions correctly. Unfortunately, different projections were needed for the same up vector and front vector, so we ended up not using this information (and adding another flaw to the technique described for coloring the voxels above). Nonetheless, we could try to build on this technique if we were to revisit the project (although it’s probably smarter to just use the colored voxel meshes ShapeNet is going to release a few days after I finish this post!).

We extracted from the csv file an array of model ids, array of up vectors, and array of down vectors. We did this using Pandas, a library meant to work well with importing large csv files, through the method below:

Review of the Data Collection Process

There were clearly many flaws in how we created the colored voxels. But at the end of the day, we got the data in a format our network could read. Additionally, through resizing the data and then compressing it, we managed to reduce the size of the data from over 5 GB to 33.2 MB. This is an over 150× decrease in size. Here is a link to the entire notebook we used for preprocessing if you were curious. And here is an overview of all the data we had after preprocessing at our disposal:

Enough about the preprocessing, which is not even actual deep learning, now onto the network!

Our Network

Introduction

We made a lot of pivots on figuring out the right network to use. We are only going to explain the methods we implemented, but linked is an unfiltered journal, on how the decision was finally made to implement the versions of the network described in this section.

We decided to use an implementation of a cGAN we found online implemented by a researcher named Takumi Moriya for our baseline. Some credit for this project goes to him, as without his work we would have not have made a good start for method 3. The code also comes with a paper (but unfortunately it is in Japanese) [8][9].

We modified the implementation we downloaded by removing parts of it unnecessary to our project. Here’s a whiteboard illustration of what we did:

Right now the illustration above probably does not make any sense, but I’ll break it down into simple pieces.

The Model’s Details

The implementation found online is based on a paper called StackGAN [10]. In the last entry of our daily journal, there is a brief description of the paper and how it applies to this project. A simple description of what a stack is though is just a conditional GAN that is conditioned on the output of another conditional GAN. Like this:

In the base implementation, GAN A generated 3D geometry and GAN B generated the colors for the geometry. However, since we already are given the geometry for the problem we are trying solve, we can remove GAN A.

So now we will dive into how GAN B works.

Our GAN is conditioned on the geometry and the colored image. We got this data through preprocessing. The colored image was a 64 ✕ 64 ✕ 3 matrix and geometry was represented with a 32 ✕ 32 ✕ 32 ✕ 3 binary voxel mesh.

We first encode the colors of the image data, or make a more compact representation. The purpose of encoding the colors of the image data is similar to the purpose of word2vec (a technique to make a vector representation of words) [11]. We used convolutional and fully connected layers in this part of the network.

The color encoder outputs an encoding for each color channel.

Afterwards we pass the encoding of each color channel along with the binary voxel mesh to the generator.

The generator also takes random noise as an input.

However, for the rest of the diagrams I won’t include it, and take for granted that it is assumed knowledge that a GAN’s generator should take as input random noise or a “latent code”. We used convolutional, convolutional transpose, and fully connected layers to “upsample” and generate the colored voxels.

The image encoding and the results from the generator are passed into the discriminator during train time. And the result is created by the generator during test time. Convolutional and fully connected layers, batch normalization, and Leaky ReLU were used for the discriminator.

You might be wondering what the edge encoder on the bottom left is. The edge encoder embeds the colors and shapes over the entire image instead of separating into RGB. It was also used in the original StackGAN version that Takumi Moriya implemented to encode black lines (or in the image below gray dotted lines) that were edges or boundaries between shapes in the input image, because the network was originally trained on 2D images like this:

We did end up training and testing a network with this encoder.

However, we also thought that it might be better to feed the results of the color encoder into the discriminator instead of the results of the edge encoder. The reason this made more sense to us is because the discriminator can make it’s judgement from the same exact input given to the generator rather than an input that requires a non-linear transformation to be the same. Thus, we implemented our own custom discriminator that took in the output from the color encoder. Here’s the code for that discriminator below:

The dimensions for most of the layers in the custom discriminator were determined by reversing the code in the generator.

With our custom discriminator, the network looks like this:

And there you have it! Here’s a gif of the whole thing:

Loss Function

We used to standard loss function for a GAN:

Here’s a visualization of the loss functions with the structure of the networks included, that was generated using TensorFlow’s tool for visualizing networks, tensorboard:

You could find all the code for network at this link.

Training

We wrote our own code to train the network. We used Adam’s Optimization (you can check out an exploration of Adam’s Optimization we wrote here) and decayed the learning rate by a factor of 0.002 every 9300 steps. We also set staircase to False, so the decay was interpolated over each step because it would be impractical in the time we had to train our networks for 9300 steps or more.

Here’s the training code that shows in more detail how we did it:

We did not make train, test splits because our data was limited (3553 models), so we risk showing results that overfit the dataset; however, we also show results that qualitatively and quantitatively are better.

Results

We ended up with 3 different GANs that we each tested.

(Model A) The first GAN was Takumi Moriya’s implementation (the network described above with the edge encoder) with his pre-trained parameters.
(Model B) The second GAN was Takumi Moriya’s implementation that we trained on the processed data we got from the ShapeNetSem dataset. A batch size of 64 was used to train this network. We used 1 hour to train 10 epochs on a Google Cloud VM instance with an 1 x NVIDIA Tesla K80 GPU (12 GB of memory).
(Model C) In the third GAN, we implemented our own discriminator and trained the entire network on the processed data we got from the ShapeNetSem dataset. A batch size of 32 was used for this network. It took about 2 hours to train 10 epochs on the same virtual machine used to train model B.

To visualize our results we wrote a function that plots the expected results and generated results for a batch of samples in a 3d grid. The code below shows how it works:

For each cell in the visualizations I’ll show below the top plot is the generated results, and the bottom plot is the expected results.

For a quantitative metric that measures the similarity between the expected and generated results, we defined the following L₂ reconstruction loss function:

We then found the mean of the all the elements in the resulting matrix. We ran this function on 10 batches of 16 samples and averaged the losses to produce a final number. A potential extension of this metric that we did not evaluate on is KL Divergence or reconstruction loss on intermediate layers of a convolutional network that downsample the image. This extension would be valid because instead of measuring the similarity in color, such metrics would also measure the similarity in the distribution of color.

Now, I’ll move on to showing our results.

Model A Results

Reconstruction Loss: 0.029293

As you can probably tell in the picture above, the results for Model A were not great. It could only color the model with yellow (R: 246, G: 251, B: 83) and green (R: 116, G: 248, B: 75). Additionally, there didn’t seem to be any pattern to the boundaries between yellow and green.

Although we were hopeful Model A might give us good baseline results because we were using someone else’s pre-trained network, upon further analysis it became clearer to us why it did not. There was one giant issue we noticed with this approach.

The pre-trained parameters were based on a training dataset of textures that had colors like this:

And results like this:

The distribution of colors in the dataset the pre-trained parameters were based on looked something like this:

While with the pictures in the ShapeNetSem dataset, since the images more closely represented natural images, the colors would be normally distributed:

This could explain the reason why the pre-trained parameters were biased to only two colors rather than a more even distribution. The parameters of the generator were trained to create results that only have a few colors.

An additional flaw with the parameters could be that they were trained on 3D models from the ShapeNetCore data set rather than the ShapeNetSem data set. This could explain why there seemed to be no pattern with the boundaries in between for the generated results, as a useful pattern would be dependent on training on the correct geometry.

Model B Results

Reconstruction Loss: 0.028027

Model B gave us better results. The colors in the generated results were more evenly distributed (although not evenly distributed enough), which points the possibility of there being an issue, relative to the problem we are trying to solve, with the structure of the layers in the network itself.

Through the visualizations, we found that the network did manage to identify boundaries between the colors.

Examples of boundaries that seemed to have been learnt are identified with red lines in the images below:

Model C

Reconstruction Loss: 0.028582

The results for Model C seemed to be more biased than the first two models towards darker colors as opposed to brighter RGB values. It also seemed to have been able to identify boundaries between colors like Model B did. However, it consistently did not perform as well as Model B on the quantitative metric, suggesting that changing the discriminator to take the input from the output of the color encoder instead of edge encoder does not yield an improvement in results.

Evaluation of This Method

Model B seems to have given the most promising result of the three. The network implemented by Takumi Moriya did seem to better than our own intuitions of how to improve the network that we implemented in Model C, and the results of Model A shows that networks usually work better when the data that was used to train the network is similar to what was used to test the network.

Although we did manage to get this technique to show signs of potential learn something, there were certainly a lot of flaws in it. Some areas we could immediately identify that could significantly improve the results of this method include training on a dataset with colored voxels and more carefully analyzing the layers of each component of the network to better understand how to modify them to work better for texturing 3D objects.

Conclusion

Style transfer on height maps seems to have generated the most appealing visual results:

However, it only works on 3D objects where each pair of x and z coordinates map to one y value. Method 2 also produced visually appealing results, while Method 3, although we surprised ourselves by actually getting our models to learn something, has a lot of room for improvement.

Why is this Cool?

Short answer: Because probably no one has done this before.

We learnt through this project that mixing 3 dimensions, color, and deep learning is really, really hard. There are not many well established techniques for solving essential tasks to most projects in this domain. For example, we struggled to find well-established techniques for flattening a 3D surface into a 2D surface. And we struggled to find a way to obtain a data set of colored voxels (as illustrated by this post on the Shape Net forum):

And there weren’t many well established techniques for feeding a 3D mesh with color into a neural network.

However, through the results we were able to get from just these preliminary experiments show that these problems are surmountable, and in fact that there is a lot untapped potential in exploring this field.

Also, automatically texturing 3d objects has a wide array of using applications.

Creative industries, like gaming (think Blizzard, Riot Games, and Minecraft) and 3D animation (Pixar, DreamWorks, Industrial Light & Magic) invest huge sums of money texturing 3D models, because that’s pretty important for making the 3D worlds they are making enjoyable:

Legions of 3D artists are usually hired for over a span of a year for one movie or one game, and each artist such large productions usually work on perfecting only a few models. For example, check out this behind the scenes video on modeling for Inside Out or the picture below of Mater from cars [12]. While we think it is unlikely that a network will ever replace a 3D artist, we think it certainly can create new, effective tools for 3D artists to use that make animation and gaming studios more productive.

In the field of marketing, texturing 3D objects is also pretty important, because it allows products to be shown in an appealing way. In fact, there are even 3D marketing companies that specialize in this, like this one.

This project opens up the door to exploring ways to significantly decrease the cost of creating in these industries. Fields like medicine may benefit, by making it easier to colorize 3D data like CT scans.

Finally, tackling a problem like this, may give us an insight into how the human brain understands color, which is a very prevalent lens through which humans see the world. By exploring problems like these, we can better understand how a child learns to identify colors and match them to objects.

Through this process we learnt that there exists a whole host of opportunities, a lot of potential for growth, and not enough work done to solve this problem! This is likely just the start.

[1] Prof. Joseph Lim, USC CSCI 599: Deep Learning and its Applications

[2] Leon A. Gatys, Alexander S. Ecker, Matthias Bethge A Neural Algorithm of Artistic Style

[3] Mehdi Mirza, Simon Osindero Conditional Generative Adversarial Nets

[4] Hvass-Labs. TensorFlow-Tutorials:15_Style_Transfer

[5] Shapenet https://www.shapenet.org/

[6] Yung-Luen Lan Voxelizer

[7] Seitz Voxel Coloring

[8] Takumi Moriya Pix2vox

[9] Takumi Moriya Sketch-Based 3D Modeling via Deep Convolutional Neural Networks

[10] Han Zhang et al. StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks

[11] Tomas Mikolov et al. Efficient Estimation of Word Representations in Vector Space

[12] Making of Pixar Inside Out