Dual-Clock Asynchronous FIFO in SystemVerilog

Apology for the lack of updates, but I have been on a rather long vacation to Asia and am slowly getting back into the rhythm of work and blogging. One of my first tasks after returning to work was to check over the RTL of an asynchronous FIFO in Verilog. What better way to relearn a topic than to write about it! Note that this asynchronous FIFO design is based entirely on Cliff Cumming’s paper Simulation and Synthesis Techniques for Asynchronous FIFO Design. Please treat this as a concise summary of the design described his paper (specifically, FIFO style #1 with gray code counter style #2). This article assumes knowledge of basic synchronous FIFO concepts.

Metastability and synchronization is an extremely complex topic that has been the subject of over 60 years of research. There are many known design methods to safely pass data asynchronously from one clock domain to another, one of which is using an asynchronous FIFO. An asynchronous FIFO refers to a FIFO where data is written from one clock domain, read from a different clock domain, and the two clocks are asynchronous to each other.

Clock domain crossing logic is inherently difficult to design, and even more difficult to verify. An almost correct design may function 99% of the time, but the 1% failure will cost you countless hours of debugging, or worse, a respin. Therefore, it is imperative to design them correctly from the beginning! This article describes one proven method to design an asynchronous FIFO.

Asynchronous FIFO Pointers

In a synchronous FIFO design, one way to determine whether a FIFO is full or empty is to use separate count register to track the number of entries in the FIFO. This requires the ability to both increment and decrement the counter, potentially on the same clock. The same technique cannot be used in an asynchronous FIFO, however, because two different clocks will be needed to control the counter.

Instead, the asynchronous FIFO design uses a different technique (also derived from synchronous FIFO design) of using an additional bit in the FIFO pointers to detect full and empty. In this scheme, full and empty is determined by comparing the read and write pointers. The write pointer always points to the next location to be written; the read pointer always points to the current FIFO entry to be read. On reset, both pointers are reset to zero. The FIFO is empty when the two pointers (including the extra bit) are equal. It is full when the MSB of the pointers are different, but remaining bits are equal. This FIFO pointer convention has the added benefit of low access latency. As soon as data has been written, the FIFO drives read data to the FIFO data output port, hence the receive side does not need to use two clocks (first set a read enable, then read the data) to read out the data.

Synchronizing Pointers Across Clock Domains

Synchronizing a binary count (pointer) across clock domains is going to pose a difficulty, however. All bits of a binary counter can change simultaneously, for example a 4-bit count changing from 7->8 (4’b0111->4’b1000). To pass this value safely across a clock domain crossing requires careful synchronization and handshaking such that all bits are sampled and synchronized on the same edge (otherwise the value will be incorrect). It can be done with some difficulty, but a simpler method that bypasses this problem altogether is to use Gray code to encode the pointers.

Gray codes are named after Frank Gray, who first patented the codes. The code distance between any two adjacent Gray code words is 1, which means only 1 bit changes from one Gray count to the next. Using Gray code to encode pointers eliminates the problem of synchronizing multiple changing bits on a clock edge. The most common Gray code is a reflected code where the bits in any column except the MSB are symmetrical about the sequence mid-point. An example 4-bit Gray code counter is show below. Notice the MSB differs between the first and 2nd half, but otherwise the remaining bits are mirrored about the mid-point. The Gray code never changes by more than 1 bit in a transition.

Decimal CountBinary CountGray Code Count
04'b00004'b0000
14'b00014'b0001
24'b00104'b0011
34'b00114'b0010
44'b01004'b0110
54'b01014'b0111
64'b01104'b0101
74'b01114'b0100
84'b10004'b1100
94'b10014'b1101
104'b10104'b1111
114'b10114'b1110
124'b11004'b1010
134'b11014'b1011
144'b11104'b1001
154'b11114'b1000

Gray Code Counter

The Gray code counter used in this design is “Style #2” as described in Cliff Cumming’s paper. The FIFO counter consists of an n-bit binary counter, of which bits [n-2:0] are used to address the FIFO memory, and an n-bit Gray code register for storing the Gray count value to synchronize to the opposite clock domain. One important aspect about a Gray code counter is they generally must have power-of-2 counts, which means a Gray code pointer FIFO will have power-of-2 number of entries. The binary count value can be used to implement FIFO “almost full” or “almost empty” conditions.

Asynchronous FIFO in SystemVerilog binary and gray pointers

Converting Binary to Gray

To convert a binary number to Gray code, notice that the MSB is always the same. All other bits are the XOR of pairs of binary bits:

logic [3:0] gray;
logic [3:0] binary;

// Convert gray to binary
assign gray[0] = binary[1] ^ binary[0];
assign gray[1] = binary[2] ^ binary[1];
assign gray[2] = binary[3] ^ binary[2];
assign gray[3] = binary[3];

// Convert binary to Gray code the concise way
assign gray = (binary >> 1) ^ binary;

Converting Gray to Binary

To convert a Gray code to a binary number, notice again that the MSB is always the same. Each other binary bit is the XOR of all of the more significant Gray code bits:

logic [3:0] gray;
logic [3:0] binary;

// Convert Gray code to binary
assign binary[0] = gray[3] ^ gray[2] ^ gray[1] ^ gray[0];
assign binary[1] = gray[3] ^ gray[2] ^ gray[1];
assign binary[2] = gray[3] ^ gray[2];
assign binary[3] = gray[3];

// Convert Gray code to binary the concise way
for (int i=0; i<4; i++)
  binary[i] = ^(gray >> i);

Constraining Timing of the Gray Code Signals

A coworker educated me that there is one more piece needed, after coding the gray code counters. Gray code solves metastability because the destination sees only 1 bit changing at any time. However, without specifying a timing constraint, it is possible for the design to be implemented as circuitry on silicon (or FPGA) such that the different bits of the gray code counter have varying path lengths, ultimately causing multiple bits to change simultaneously at the destination flip flops! This would defeat the whole purpose of having the gray code counters.

To guarantee functionality, a timing constraint needs to be applied to all bits of the gray code counter that are crossing the clock domain. For a fast to slow clock crossing, the wire for each of the gray code counter bits (that cross the clock crossing) needs to have a max delay constraint applied, with the maximum delay being one period of the fast clock. This will ensure that the electrical signal for all the bits will transition, at the destination flip flops (clocked by the slow clock), within one fast clock period. That then guarantees at any point in time (especially, at any time when the slow clock ticks), there can only be a one bit transition on the gray code counter signal, as seen by the destination gray code flip flops, because all bits of the gray code counter have been “timing matched” to within one fast clock.

Generating full & empty conditions

The FIFO is empty when the read pointer and the synchronized write pointer, including the extra bit, are equal. In order to efficiently register the rempty output, the synchronized write pointer is actually compared against the rgraynext (the next Gray code to be registered into rptr).

The full flag is trickier to generate. Dissecting the Gray code sequence, you can come up with the following conditions that all need to be true for the FIFO to be full:

  1. MSB of wptr and synchronized rptr are not equal
  2. Second MSB of wptr and synchronized rptr are not equal
  3. Remaining bits of wptr and synchronized rptr are equal

Similarly, in order to efficiently register the wfull output, the synchronized read pointer is compared against the wgnext (the next Gray code that will be registered in the wptr).

Asynchronous FIFO (Style #1) – Putting It Together

Here is the complete asynchronous FIFO put together in a block diagram.

Asynchronous FIFO in SystemVerilog block diagram

The design is partitioned into the following modules.

  • fifo1 – top level wrapper module
  • fifomem – the FIFO memory buffer that is accessed by the write and read clock domains
  • sync_r2w – 2 flip-flop synchronizer to synchronize read pointer to write clock domain
  • sync_w2r – 2 flip-flop synchronizer to synchronize write pointer to read clock domain
  • rptr_empty – synchronous logic in the read clock domain to generate FIFO empty condition
  • wptr_full – synchronous logic in the write clock domain to generate FIFO full condition

Sample source code can be downloaded at the end of this article.

Conclusion

An asynchronous FIFO is a proven design technique to pass multi-bit data across a clock domain crossing. This article describes one known good method to design an asynchronous FIFO by synchronizing Gray code pointers across the clock domain crossing to determine full and empty conditions.

Whew! This has been one of the longer articles. I’m simultaneously surprised that 1) this article took 1300 words, and that 2) it only took 1300 words to explain an asynchronous FIFO. Do you have other asynchronous FIFO design techniques? Please share in the comments below!

References

Sample Source Code

The accompanying source code for this article is the dual-clock asynchronous FIFO design with testbench, which generates the following waveform when run. Download and run the code to see how it works!

Asynchronous FIFO 1 waveform

    Answer

    .

    45 thoughts on “Dual-Clock Asynchronous FIFO in SystemVerilog”

      • Hi allegories, that’s a very interesting comment. I went to educate myself on what a Johnson counter is. It generates a Gray code as well (adjacent states differ by only 1 bit), which satisfies the requirements needed of a clock domain crossing counter. It also has the same disadvantage as gray code counter of only being able to represent power-of-2 states. However, an N-bit counter can only represent 2N numbers, which will require more flip-flops than the gray counter in this example code.

        I’m trying to think of what the advantage would be of using that in an asynchronous FIFO design. To address the FIFO memory you’ll still need to keep a binary counter, so the Johnson counter would only be used to synchronize across the clock domain. You’ll need to devise another way to determine FIFO full/empty condition from the Johnson counter progression (doable). I assume you’re thinking of a Johnson counter because of the simple logic and potentially being able to run at higher frequency? That may very well be true, although in my limited asynchronous design experience, the gray code counter has never been the critical path in my design.

        So in summary I can’t see much benefit of using it right now. What is your experience with using Johnson counters?

        Reply
        • Hi Jason,

          In previous reply you said
          “However, an N-bit counter can only represent 2N numbers, which will require more flip-flops than the gray counter in this example code.”

          I didn’t get your point. Can you please elaborate?

          Reply
          • Hi Parth. That comment was specific to the suggestion (from the poster, allegories) of using a Johnson counter, which is not what is illustrated in this article (this article uses a gray counter).

            Reply
      • Hi Shahid. That is simply an observation from the “Gray Code Count” sequence in the table in the Gray code sequence table above, but it holds true generally. The sequence in that table can represent the item count in an 8-deep FIFO (with an extra MSB bit for tracking full/empty). So for example, when rptr=0 decimal, wptr=8 decimal, the FIFO is full. If you then look at the Gray code for 0 and 8 decimal, you’ll see all 3 conditions hold true.

        Reply
    1. Hi Jason,

      Thanks for the article! I have a question about synchronizing a write and read pointers with double flip flops. This works great for crossing from slow to faster clock, but what about backwards, when crossing from fast to slow clock?

      Regards

      Reply
      • It is true that in that case, the synchronization logic can miss a pointer transition on the slow clock side. But that is okay, because when the slow clock edge happens, the pointer value that is synchronized to the slow clock side is going to be the correct value, and the FIFO level calculations will be correct. It is a case of synchronization where not every value needs to be synchronized, but when synchronization happens, the value will be correct. For a bit more explanation, take a look at my post on Clock Domain Crossing Design – 3 Part Series

        Reply
        • But if one or more values are skipped, then back-to-back values seen by the (slow) receiver can differ by more than one bit, despite the use of gray code. Is this not a problem?

          Reply
          • Hi Max. I realized there is an important detail I missed/assumed in the article. Besides coding the RTL design correctly, there also needs to be a timing constraint placed on the signals of the gray code. The timing constraint has to say that each of the bits of the gray code, must be sent from the fast to slow clock, within 1 cycle of the fast clock. This is a timing constraint on the physical wire, to constrain the flight time of the signal on the wire. With this constraint, it will ensure that the recipient flip flops, on the slow clock, will see the values changing 1 bit at a time (at the rate of the fast clock). Therefore at any point in time when the slow clock ticks, there can only have been 1 bit transition on the gray code signals.

            Reply
    2. Can you publish examples of different clock frequencies (C1 and C2), and also depth and width of memories needed for different clock frequencies without them going full/empty quickly etc.?

      Reply
      • Hi Rajesh. Your question is very design dependent so it’s difficult to give general recommendations. However first of all, the throughput on each side of the clock domain crossing needs to be the same. For example if one side is 2x the frequency as the other, then that side must also somehow pass data at 1/2 the rate so the bandwidth of both sides is the same. That may be achieved by passing data every other clock, or having a twice as wide data bus, etc. Then the depth of the FIFO depends on the latency and behaviour of your logic. If your logic is conditioned on the FIFO full/empty signal then the FIFO shouldn’t overflow/underflow. But depending on the behaviour of the logic the data transfer may “stutter” when starting and stopping which may require you to increase the FIFO depth. Some amount of experimentation//simulation may be needed to figure out the optimal size.

        Reply
    3. Hi,

      Isn’t it more simple to convert the gray pointers after the synchronization into binary pointers and then do the comparison ? i.e. suppose we have after the conversion binary pointers with N bits (MSB bit for wrap, addressing the memory element with MSB-1:0) :
      empty = rd_ptr_bin == wr_ptr_bin
      full = rd_ptr_bin(N) != wr_ptr_bin(N) AND rd_ptr_bin(N-1:0) == wr_ptr_bin(N-1:0)

      Reply
      • Hi Ophir. Yes you could convert the synchronized gray code pointers back to binary combinationally and then do the binary pointer comparison (also combinationally, to avoid an extra cycle). However, that would create more combinational logic, and potentially reduce the maximum frequency of the design.

        Reply
    4. Hi Jason,

      First of all, thank you very much for the post. Secondly I have the same question as Shahid, because what I was thinking is, if dealing with a 8-entry fifo, you can declare a 4-bit rptr and 4-bit wptr, so as you mentioned when rptr is 0 in decimal and wptr is 8 in decimal, it is a full condition. However in this case I think we don’t need one additional MSB on top of that? But from your description, when dealing with a 8-entry async fifo, both of the rptr and wptr will be 5-bit wide. That is what makes me a bit confused.

      Reply
      • Hi Yicheng. You’re correct that an 8-entry fifo should only need a 4-bit rptr and wptr to detect empty and full. The example code implements it this way as well (the code has a 16-entry FIFO with 5-bit pointers). Which part of the article is unclear and suggests an pointer extra bit is necessary?

        Reply
    5. Hi Jason,

      It is a very good article. It cleared most of my doubts. However, I have one. Why there is no rden signal on the reading side, like wren. I think it should be rden = rinc & ~rempty. Please correct me. Thank you.

      Reply
      • Hi Hari. The assumption in this design is the FIFO memory combinationally outputs the read data of whichever read address is selected. Therefore it doesn’t need a rden signal. In a previous design I just built this “memory” out of flops to achieve this behaviour. But if you want to use an actual SRAM memory, then the design will need to be modified to accommodate a rden signal.

        Reply
    6. Hi Jason,

      Thank you for the detailed explanation.
      I have a question regarding empty flag generation. For full flag generation, we check the 3 conditions. Why don’t we have to check the similar conditions for empty flag?

      Reply
      • Hi Nik. One answer is due to the design of the FIFO pointers (using an extra bit to detect full/empty), the empty detection is inherently simpler because you only need to match that the pointers are identical. While the full condition is trickier because the MSB of the pointers are different when the FIFO is full. This is then further complicated by the gray code.

        Reply
      • Hi Rkyv. I haven’t heard of other users having problems. Did you receive a email from the site? You need to click on that email to verify your email address before you will see the download link.

        Reply
    7. Hey Jason,
      Thank you for sharing this article with us.It would be really helpful if you could provide the circuit diagram along with block diagram.

      Reply
      • Hi Surekha. There is a block diagram in the article, is something unclear there? There are pieces of the circuit diagram in the article, but you’re right I don’t have the whole thing drawn out in one figure. Maybe something to add in the future.

        Reply
    8. Hi Jason,
      Thank you for this great article, this is really helpful for me.
      But can I ask one thing? How can I know the FIFO level (how many data is in the FIFO)? Is there a simple way?

      Reply
      • Hi James. I think similar to the idea I commented about “almost full/empty” flags, you can use the binary pointer values to determine the number of entries currently in the FIFO, once you pass one of the pointers (in gray code format) across to the other clock domain. You can then do a binary comparison of the two binary pointers.

        Reply
        • Hi Jason, thank you for this reply.
          I agree that’s the feasible and straight-forward way to measure FIFO level.

          Reply
    9. Hi Jason,

      Very impressive post and detailed explanation!

      I just have one quick question:

      How to design this async FIFO when depth is not 2^N? Especially, the gray code won’t work for this case when pointer wraps around.

      Thanks,
      -Hao

      Reply
      • Hi Hao. That’s a good question. I actually don’t know what’s a good way to handle that, since like you said gray code won’t work for non-power-of-2 sizes. In the past I made sure to design the FIFO size to be power of 2, and I’ve gotten away with it. If you do come across another solution please let me know too!

        Reply
        • Hi Jason,

          I did some search and found the solution to deal with non-power-of-2 sizes is to create a gray code not starting from 0. For example, using example in your post, let’s say we want to design a FIFO with size=14; in your decimal to gray code table, we just don’t use 1st and last row. The rest of 14 rows is now the new gray code. We also need to modify decimal counter to have it count from 1 to 14 instead of 0 to 13.

          However, the limitation still occurs: it can only handle FIFO depth with even size(since the original gray code is mirrored). For odd FIFO depth, I haven’t got a solution.

          Thanks,
          -Hao

          Reply
    10. Hi Jason,

      Thanks a lot for this article.

      I also have a question regarding double sync of the pointers. I agree that, as you commented under Snjna’s question, in this async FIFO case, destination domain doesn’t need to sync every value from source domain.

      Suppose WClk is faster than RClk, then Wptr could increase several times before being captured by RClk. Even if it’s captured, it takes at least 2 RClk due to the double sync mechanism. Is it true that this could lead to some “false” FIFO empty judgement in RClk domain even though the FIFO is not empty, since RClk domain is still using the previous synced Wptr value, while the actual Wptr in WClk domain already increased and corresponding FIFO slots have been filled?

      Similarly a “false” full judgement could be made the other way around. I think functional-wise this is not causing any issue, just adding some sort of varying spacing between Wprt and Rptr depending on W/Rclk frequencies?

      Thanks a lot!

      Jimmy

      Reply
      • Hi Jimmy. You’re correct that even after a write, because of the CDC, empty will be slow to deassert and indicate !empty. Similarly full will be slow to deassert and indicate !full. Both are safe and won’t cause FIFO to overflow and malfunction (as long as write side logic observes full, and read side logic observes empty). But yes that extra variable delay is the penalty of crossing a CDC.

        Reply
    11. Hello Jason,
      I have written verilog code for asynchronous FIFO where I have described the hardware you have mentioned. I am getting the output but there are some problems when I am comparing read pointer and write pointer. It generates full or empty condition when FIFO is not full or empty. The problem occurs because in read domain when I compare read and write pointers , the read pointer is of that domain while the write pointer is of another domain which is being synchronized in read domain so it gets delay of two clock cycles. Same problem is occuring in write domain while comparing both the pointers. It is generating fifo full or fifo empty condition undesirably. Can you please help me out.
      Thanks.

      Reply
      • Hi Payal. You’re correct that due to synchronization of the pointers across clock domain, the full and empty signals do not update “instantaneously”. Their behaviour is pessimistic, meaning a full condition will immediately be reflected, but it takes some cycles after FIFO is no longer full for the full signal to deassert. Similarly once the FIFO is empty, empty immediately asserts, but it takes some cycles after data is written for empty signal to deassert. This is actually safe behaviour because as long as your write logic always looks at full, and read logic always looks at empty, the logic will be guaranteed to not overflow/underflow the FIFO.

        Reply
    12. // Convert gray to binary ==> has to be “binary to gray”
      assign gray[0] = binary[1] ^ binary[0];
      assign gray[1] = binary[2] ^ binary[1];
      assign gray[2] = binary[3] ^ binary[2];
      assign gray[3] = binary[3];

      Reply
    13. Hello Jason,
      In this article provided, the rd ptr and wr ptr are multibit. is that multi bit ptr’s are sending serially through cross domain or it is parallel transmission, so the synchronizer used is each bit need to have two flop synchronizer

      Reply
    14. How to run your code?
      I tried icarus verilog, but failed.
      What’s your simulation envirement to generate the waveform picture?
      Thanks!

      Reply
      • Yes that will work too. But for a real life design you should separate the test bench from the design so the design files can be taken through the complete design flow including synthesis and physical design.

        Reply

    Leave a Comment

    This site uses Akismet to reduce spam. Learn how your comment data is processed.