The Caspian Sea as Battleground

Populated at the time by fluent Hebrew speakers, the Israel desk of Armenia’s foreign ministry waited back in 1991 — in the immediate wake of the collapse of the Soviet Union — for a phone call that…

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

Building a Deep Learning Machine

There is no shortage of literature online containing opinions about hardware for deep learning machines, nor is there a shortage regarding provisioning the machines with common deep learning tools. However, there are not many resources that provide critical consideration of why the hardware was chosen, nor are there many resources that walk you through both building and provisioning in one place . This attempts to do both.

A lot has been written comparing and contrasting the various GPUs available, as well as their costs. There isn’t much point in diving into a granular relative cost argument here other than to reinforce that there are two drivers related to commodity GPU cost: number of cores and size of memory. For example, a GTX 1060 6GB has 1280 cores running at approximately 1.5GHz, and obviously 6GB of memory. A GTX 1080 Ti has 3584 cores running at approximately 1.6GHz, and has 11GB of memory. If you are not fully utilizing your GPU, you will not see all that much of an improvement between the GTX 1060 6GB and the GTX 1080 Ti. The reason for this is that the architecture of both cards are the same (Pascal), and the main difference is the number of cores and the speed of the cores. The benefit of the GTX 1080 Ti is more about capacity and memory bandwidth than it is about compute speed. If you consider the commodity GPUs like the GTX 1060 6GB and the GTX 1080 Ti, you will notice that the price scales with capacity. Barring unexpected market forces like the temporary increase in prices due to cryptocurrency mining demand, the prices actually tend to scale sublinearly with capacity. Your best strategy for value is to get a commodity GPU with the highest capacity you think you will need and that you can afford.

The specialty GPUs typically offer higher capacity and more condensed configurations, and come with an increased cost. GPUs like the K80 and V100 give you a more dense array of GPU hardware. The architecture tends to be more advanced at the time of introduction compared to the commodity GPUs. If you really need the speed and the added capacity, these may be good choices, but they are quite a bit more expensive. Somewhere in the middle are the Titan series GPUs. Typically the Titan series has introduced more cores earlier than the commodity cards, and tend to have more memory. The Titan V series are the ones to pay attention to at the moment, as they have more cores than the GTX 1080 series, more memory, and include the new tensor cores (basically a slimmed down PCIe edition of a V100). Your best strategy for performance is to get the newest, highest capacity GPU you can afford.

The most important big picture consideration is your understanding of your problem, how it scales, and what capacity you need to accomplish your training in the timeframe you require. The more cores and the more memory on your GPU, not only the larger model you can use, but the larger the batch sizes you can use. Take some time to think about how your problem scales and what future problems you may want to work on, many times these considerations will help guide you to a best GPU strategy.

Critical to training and evaluating your models is the memory bandwidth of the GPU and the bandwidth between you GPU and CPU. The memory bandwidth of the GPU relates to how fast on-GPU memory can be read and written. In this context, amongst the commodity GPUs, products like the GTX 1080 Ti are superior, with an aggregate memory bandwidth of over 480 GB/s. In case you don’t have a sense for that figure, it is incredibly fast.

Training a model has fixed computational costs and marginal computational costs. The fixed computational costs are things like initializing the model weights on the GPU, whereas the marginal computational costs are things like transfering batch data to the GPU. The most important part to focus on is the marginal computational cost since that affects training time the most. The bandwidth of the bus between the CPU and the GPU governs a controllable aspect of the marginal computation cost and ideally should be as fast as possible.

The GPU is connected to the CPU by the PCIe bus. The PCIe bus is organized as a collection of parallel “lanes” that are each bidirectional serial interfaces. Each lane under PCIe 3.0 runs at slightly under 1 GB/s bidirectionally. The GPUs mentioned in this article all support x16, meaning 16 multiplexed PCIe lanes (with an aggregate throughput of approximately 15 GB/s). The motherboard often is in control of multiplexing the bus, so a motherboard that supports as many x16 PCIe slots as possible is required to attain full bandwidth support. This is done by what is often refered to as the northbridge controller, which is an off-CPU glue component that is functionally closest to the CPU. You have to keep in mind that just because the motherboard can multiplex the 16 lanes doesn’t mean your CPU supports them (more on that later). The point is that you need to identify the ability of your motherboard to support the number of lanes for each GPU you intend to install.

Your GPUs are not the only devices that use PCIe lanes. Some high speed devices like M.2 solid state hard drives are directly multiplexed by the northbridge controller. As well, many of the lower speed controllers and connectivity are also derived from PCIe through what is often referred to as the southbridge controller. It is termed southbridge because it is functionally the furthest from the CPU. All of this is to make you aware that you have to take a holistic view of the PCIe lanes and what they are used for, not just the GPUs. The best M.2 hard drives are x4 PCIe devices, meaning that they have a bandwidth of just under 4 GB/s. There are also M.2 RAID controllers that allow for multiple x4 M.2 drives, and can produce better performance for a premium price.

There are a few considerations to be attentive to relating to your choice of CPU for a deep learning setup. This may come as a surprise since the general thought is that the GPU is doing all the work. Most critical to the CPU consideration is how many PCIe lanes are brought out of the die, how many cores are on the die, and how much L3 cache is on the die.

The reason why it matters how many PCIe lanes are brought out of the die relates to the discussion in the previous section. You absolutely must have enough PCIe lanes to establish a wide multiplexed bus to the critical hardware, meaning both your GPUs and your hard drives. You can easily look up how many PCIe lanes are exposed for a given family of CPU. What you should do is research and understand both your processor and your motherboard to make sure that they fully support the bandwidth budget you are attempting to meet. For example, if you have two GTX 1080 Ti GPUs that you intend to run x16 and one M.2 SSD that supports x4, then you will require 36 PCIe lanes just for these peripherals. The next highest number of PCIe lanes from a typical processor is 44 lanes, so you would be able to budget this arrangement as well as the built-in lower speed southbridge hardware using this processor.

The number of cores available and the amount of cache have a profound effect on some types of training. If you are doing data augmentation, you will want to be able to use a threading model to augment the data as fast as possible. The key metric of interest is the throughput of the augmentation. Since you just set aside x16 PCIe lanes for the GPUs, there is some serious work to do to compete with the 15 GB/s bandwidth of the bus. Consequently, having as many CPU cores as you can afford, and as much on-die cache as you can afford, is the only way to get sufficient traction.

As an example, suppose you are training using augmentation of the STL-10 data set. An image is three channels of 96x96 pixels. Suppose your GPU can sustain training 10 batches of 1024 images each second. This means that your PCIe bus for the GPU needs to be able to handle a bandwidth of just shy of 300 MB/s. Not a problem with our x16 PCIe configuration. The CPU needs to be capable of performing augmentation operations at the same rate. That leaves you something on the order of 10 CPU cycles per pixel per core. Depending on the specifics of the augmentation, you will require more or less cores to be able to accommodate that rate. The key consideration is knowing how many cores you actually need, which obviously depends entirely on your particular problem.

The assumptions made here are merely suggestive that for this problem the CPU is the bottleneck, at least for an augmentation example where the original full training set can be loaded in memory. In reality, depending on the specifics of the problem, different things can be the bottleneck. An extremely computationally complex model that does not use augmentation, but rather has an extremely large training set, may be limited by either the GPU or disk. One of the reasons that people worry about things like PCIe lanes for the GPU is this is a simple aspect of the problem to be able to eliminate as a bottleneck.

There are some circulating general rules of thumb for building deep learning machines. One is that you should keep the number of cores at least 1.5 times the number of GPUs. Personally, I don’t think this is adequate. It may be from the perspective of CPU load, but given the simple example above it is absolutely clear that the CPU is often a bottleneck, particularly for problems involving augmentation. Another rule of thumb is that you should have at least the total GPU memory available to your CPU. The reason for this is that you will be creating batches in RAM and transferring them to your GPU, so to maximize your throughput without swapping you would need CPU memory matching the total GPU memory plus OS plus all running programs. This is pretty obvious and a good idea.

The above considerations were the general framework for selection of components for the machine I recently built. I knew that I was primarily interested in heavily augmented training of convolutional networks, so I wanted to have the GPU capacity to train as much as my budget allowed. Consequently, I selected to use two GTX 1080 Ti GPUs. I also selected to have a third GPU just for video so that the two main GPUs didn’t incur utilization for display. I didn’t want the PCIe bus to be a bottleneck, so I attempted to maximize its performance by supporting x16 connection to the GPUs, which governed the motherboard and CPU selection. In order to support the x16 decision, I needed a motherboard that supported multiplexing two PCIe slots each at x16, as well as a CPU supporting 44 PCIe lanes.

There are a number of other considerations that were important. Each GTX 1080 Ti consumers 250 watts at full load, and the system as a whole comes in at slightly over 900 watts with everything running at full capacity. To give a reasonably margin, I opted for a 1200 watt power supply. That power density, even in a fairly generous case, would get quite hot, so I opted for fully liquid cooled CPU and GPUs. With this setup I never see the GPUs running with a reported temperature above about 45 degrees, whereas on my older fan-cooled deep learning machine, the GTX 1060 6GB routinely runs at over 70 degrees.

Below is listed the specific hardware that was selected:

One elephant in the room is why the i9–7900X rather than something like an i7–7800X. The answer to this is the augmentation bottleneck. An i7–7800X has 6 cores and about 8MB of L3 cache, whereas the i9–7900X has 10 cores and about 14MB of L3 cache, giving the i9–7900X a notable advantage for augmentation. It is convenient that the choice of the i9–7900X also implies more PCIe lanes to support the x16 GPUs busses, thereby completely eliminating that potential bottleneck. Moreover, the i9 series offers significant upgrade potential, with up to 18 cores and over 24MB of L3 cache.

The standard reason for not upgrading to the i9 series CPU is cost, but I counter this with the observation that for many problems the performance bottleneck of the CPU can severely limit the aggregate system performance. It is a far greater waste of money to put another GPU in a machine that can’t adequately generate training data than it is to upgrade the CPU and achieve the desired performance. This argument is clearly predicated on both an interest in the higher performance and the specifics of the model and training data.

Would I build this machine again knowing now what I do from building and using it versus potential alternative? Yes, but as you might expect it depends on the problem at hand. If I were working with smaller data sets and smaller models I would be just as satisfied with my old GTX 1060 6GB machine, which is a junker commodity Dell Inspiron i5 with a better power supply and GPU added. It doesn’t have near the capacity, but is perfectly capable for smaller problems and is much less expensive. Have I thought about alternatives like a commodity i7 with something like a single K80? Yes, but at this point you get more capacity with the dual GTX 1080 Ti and both options have the CPU bottleneck for the type of problems I generally work on. To eliminate this bottleneck you may as well go with the dual GTX 1080 Ti with the i9 and have a more capable GPU setup. What about a Titan V? Pretty high price to swallow for a video card. Frankly, I would like to build one and see. The Volta architecture with the tensor cores (cores specifically designed around 4x4 matrix FMA at FP16 precision) is really interesting. The performance of the rest of the Titan V appears to be a slight upgrade from the GTX 1080 Ti, but not remotely enough to justify the cost. It is all about taking advantage of the tensor cores. One thing for certain is you could build a smaller machine with many less expensive components selected compared with a dual GTX 1080 Ti setup. When I build a machine with a Titan V I will be sure to write an article about how it went.

And where does this leave renting the hardware? Just do the cost math. Look at how much you expect to be training, what hardware it requires, and see what your budget would be for performing the training on rented hardware. It doesn’t take much training to find that building or buying is a better proposition than renting over a reasonable timeframe, not to mention catastrophes in renting such as leaving a large instance on for a day or two unintentionally. If you don’t do much deep learning, need transient large capability, or are growing a team and do not know exactly what you need, perhaps renting is reasonable idea.

The problem with X is that the two GPUs for computation are in the closest PCIe slots to the CPU. Being in the closest slot generally means you are the default video card. Default video card generally means you are used for video rather than computation. The plot thickens because while you can configure X to use fairly well whatever video card setup you please, your BIOS most likely does not give you the option, whereupon you will loose vision of the BIOS boot screen and have to physically connect your monitor to your primary video card to see it in case of catastrophic problems if you are running a non-default video card for your primary display.

Since I have the GTX 710 for video and it is in the furthest away PCIe slot, some configuration changes had to be made. What you need to do is create an xorg.conf file that indicates the correct video card device. The easiest way to start is to run nvidia-xconfig, which will create the configuration file for you and tell you where it is located. What you are going to have to do is look under the “Device” section and add a line specifying the PCIe bus ID of the video card you want to use. The Bus ID can be found a few ways. First, you can use lspci | grep VGA and look for the card. This will report the Bus ID in hex, so be sure to convert it to decimal for the purposes of xorg.conf, as you will be sorry otherwise (meaning X will fail to start and you will need to Ctrl-Alt-F2 to get to a terminal to fix it). Second, you can use nvidia-xconfig --query-gpu-info to get the information in a format that is exactly what xorg.conf is looking for. For me the line that I had to add was BusID "PCI:101:0:0". Just reboot the machine and it should work as expected. You should have video from the correct video card, and next time you run nvidia-smi you should not see any X utilization of your compute GPUs.

At this point I hope some insight has been shared regarding choices of hardware for building a deep learning machine. Hopefully if you are interested in building your own you have a bit more clarity than you did before. My hope is that by walking you through my own considerations and build process, some of the questions you may have are answered. More than just the selection of hardware, my hope is that the ideas about provisioning your machine will help get you up and running quickly as well.

The Caspian Sea as Battleground

Building a Deep Learning Machine

Add a comment

Related posts:

Noviembre

5 Ways to Prevent Cyber Crimes From Derailing Your Business

How to Boost Conversion by Perfecting Customer Journey Mapping