Adding an optimized implementation of Roman's 1-wire concept

#1 Open

Bitbucket cannot automatically merge this request.

The commits that make up this pull request have been removed.

Bitbucket cannot automatically merge this request due to conflicts.

Review the conflicts on the Overview tab. You can then either decline the request or merge it manually on your local system using the following commands:

hg update default
hg pull -r 132518419577
hg merge 132518419577
hg commit -m 'Merged in serisman/new-liquidcrystal/default (pull request #1)'
  1. Stephen Erisman

Great work there and a brilliant contribution to the library.

I have to see if this is pull in to the main branch this addition to the library.

Comments (44)

  1. Stephen Erisman author

    Here are the Benchmark II results (16MHz Arduino w/ ATmega8)

    Interface: SR1W

    ByteXfer: 575uS

    16x2FPS: 51.11

    Ftime: 19.56ms

    This puts it between the original 4-bit implementation and the slowest i2c implementation.

  2. Stephen Erisman author

    Additional benchmark data (16 MHz Arduino w/ ATmega328P)

    Interface: SR1W

    benchmark1: 20106 us

    benchmark2: 228820 us

    benchmark3: 24461 us

    benchmark4: 24440 us

    avg. write: 587.89

  3. Stephen Erisman author

    I found a way to make a major speed improvement (more can be done as well). The pull request has been updated with the new code.

    Here are the new benchmark results:

    • Interface: SR1W

    • benchmark0: 6823 us

    • benchmark1: 78821 us

    • benchmark2: 8035 us

    • benchmark3: 8016 us

    • avg. write: 197.03

    For Benchmark II:

    • Interface: SR1W

    • ByteXfer: 184uS

    • 16x2FPS: 159,66

    • Ftime: 6.26ms

  4. Bill Perry

    Very cool. I'll have to wire it up and try it. Without the diode I was only ever able to get it down to around 300us ByteXfer. One of the things I never fully solved was getting it to work on ChipKit which only drives outputs at 3.3V which alterns the charge/discharge time. I got something to work, but it ended up being quite a bit slower on a processor that was much faster than the AVR. I'll have to look at this circuit and see how the 3v affects the timing. Overall, pretty funny that a single wire interface can be faster than the original Arduino Library that used 6 pins. It really demonstrates how crappy the DigitalWrite() implementation is.

    --- bill

  5. Stephen Erisman author

    What ChipKit version have you been trying to get this to work with? I have never used one, but I am intrigued.

    The way I understand it voltage shouldn't matter much. Sure, the capacitor would reach 3.3v quicker than 5v if it was being charged with the same current, but you have a fixed resistor value which means less current for a lower starting voltage. As far I understand it, it is the ratio of resistance to capacitance that determines the charge/discharge time. I could be wrong on this though.

    That being said, (with this design) there are two very important timing constraints that could be coming into play.

    First, and probably most important, is the time it takes to toggle the line to clock in a '1' during the shift portion, or to clock in a '0' during the clear portion. In the code, you will see I tried to make this as short of a delay as possible and made sure nothing could interrupt the processor while performing this toggle. Ideally, it would be 1 clock cycle, although that is not easy to accomplish without getting into assembly code (even then, maybe that is too aggressive). If this delay is too long, you will either get the wrong value shifted in or you might accidentally trigger the Latch/EN, or the capacitor might not reset fast enough for your next shift cycle. I ran across this problem several times while writing this code. I finally figured out what was going on and tried to optimize it as much as possible. It is still a few more clock cycles than I would like (about 6 currently), and I doubt it would work at all if it has to fallback on the slow digitalWrite calls.

    The second time constraint is also very important, but much easier to control. It takes the capacitor a certain amount of time to drain and then re-charge (or charge and then drain). This is the RC time constant and you should be good as long as you delay for at least this long (plus a safety margin). It doesn't hurt anything (other than your speed) to delay for much longer than needed. I tried to pick a reasonable delay based on the resistor/capacitor combination and the desire for this to be as fast as possible. There probably isn't much that can go wrong with this timing constraint (other than it just being wrong) because the code just forces the processor to waste time for a certain amount of time. No matter what the clock speed is, it should waste roughly the same amount of time.

    So, my guess is any problems are probably mostly related to the first timing constraint. It could cause the following issues:

    1. For a slower processor, each clock cycle takes more time which will drain the capacitor further. At some point it will have drained too far and bad things can happen.

    2. For the ChipKit, the code probably compiles differently than on the AVR/Arduino platform. Even though the clock is faster, it may end up taking more clock cycles for the processor to toggle the lines which could once again cause the capacitor to drain to a point where bad things start happening. If this is the case, the only option is to increase the RC constant and lengthen your 2nd time constraint.

    This code has only been tested (so far) with a 16 MHz clock on an AVR/Arduino. It assumes a 1.5k resistor with 2.2nF cap for an approx 3.3uS RC constant. I actually used an 1.8k resistor because I didn't have an 1.5k resistors on hand, which changes the RC constant to 3.96uS. For the Arduino @16MHz this means we have drained the capacitor approx 1/10 by the time the clock is triggered (6 cycles / (16 * 3.96). So a 5v line looks like 4.5v to the shift register (still easily within it's range for a high signal). For a low value, we are delaying a full 5uS to allow the capacitor to more than fully drain, and then delaying another 5uS to allow it to fully charge back up before moving on.

    A slower clock on the AVR/Arduino may cause problems (I can test this later). A faster clock should be fine. If a faster clock is used, technically you can can probably lower the RC time constant which would reduce the delay requirement a bit and provide an even faster speed. You just really need to be able to handle the first timing constraint (and pick your delay to match).

    Let's take a look at a hypothetical example of a 1nF cap and 1k resistor which has an approx 1uS RC constant... For an (overclocked) AVR @24MHz, 1uS is 24 clock cycles. So the capacitor would completely charge/discharge in 24 clock cycles. The current code (on the AVR) takes about 6ish clock cycles to toggle the line, so the capacitor would already be 1/4 (or so) discharged. This may or may not cause a problem, but @16MHz the capacitor is already 1/3 discharged, and @8MHz the capacitor is already 3/4 discharged which would almost certainly cause a problem. You can see how a faster clock should mean you can get by with a shorter delay (if you lower your RC constant), but for a slower clock would need a larger RC constant (and therefore longer delay).

    I also can't believe some of the bloat that exists in the Arduino Libraries. I also wrote the same 1-wire implementation for a pure AVR C and assembly library (still private for now). The code ended up being much easier to troubleshoot, 1/4 the size, and quite a bit faster. If you have extra flash space and cpu cycles though, the Arduino platform can definitely be much easier/faster to deal with.

  6. Bill Perry

    On the chipkit issue. I'm using an UNO32 and I was using my code written from scratch about a year and half ago vs yours. I'm the one that added the chipkit support to fm's library so I'm very familar with its timing and io registers vs the AVR. I got my 1 wire stuff working on both platforms. Output voltage in this case very much matters becuase the 595 is "seeing" analog voltages vs digital voltages. The uno32 board is 5v tolerate but drives outputs at 3v. The lcd is a 5v device expecting normal 5v levels. The 3v outputs from the chipkit normally work ok, but since we are using an analog voltage from the RC network, the RC timing changes. You have to wait longer because it takes longer to pull the RC network up high enough to register as a 1 for a 5v input when using a 3v signal vs a 5v signal. I also ran into an issue on the quick pulses. Because the 3v is barely enought to drive a 1 on a 5v input, it didn't take much to drop the RC voltage down below the threshold. So even several quick pulses might be enough to drop the voltage down enough on the RC network. So there had to be recovery delay even on the short pulses.

    In hooking up the circuit to try your code, it got very confusing. Actually I wasn't confused at first since I simply wen't by the ASCII schematic, but then when things weren't working, I noticed some things that seem "off" and confusing. There are some issues between the comments, the ascii schematic, the output bit constants, and the shiftout bit order. On a 595 normally bit 0 is QA and bit 7 is QH. Some manufactures call the pins Q0 to Q7. If you look at pinout that uses Q0-Q7 naming Q0 will be the same pin as the QA pin so I think using bit 0 for QA makes more sense especially since that will allow bit 0 to be Q0 on the other manufacturers datasheets. The other SR code modules use this convention and hence shift the data out MSB first. It would be nice if this module used the same convention to avoid any confusion.

    Ignoring the bit numbering, the comments say EN/latch is QH and the schematic shows using Q'H (which is it?) What I think would be nice would be if this code used the same bit numbers and pins for the outputs as the SR2W to keep things consistent, but as a minimum I think it would be helpful to umber the bits so that bit 0 is QA or Q0. This would seem to be less confusing and would also be consistent with the other modules.

    Here is the loadSR() function that I was using in my code: It didnt' use the fio stuff but did shift out msb first.

    void LiquidCrystal_SR1W::loadSR(uint8_t value) 
         * Send out the upper 7 bits of the byte
        for(uint8_t bit = 7; bit > 0; bit--)
            if(value & 0x80)
                *_srdclk_addr &= ~_srdclk_mask;
                *_srdclk_addr |= _srdclk_mask;  // clock in a 1
                *_srdclk_addr &= ~_srdclk_mask;
                *_srdclk_addr |= _srdclk_mask; // clock in a 0
            value <<= 1;
         * send a long pulse to send dummy bit 0 and latch the data
        *_srdclk_addr &= ~_srdclk_mask;
        *_srdclk_addr |= _srdclk_mask;

    Still working on getting your code up and running on my 16mhz m328 based hardware.

  7. Stephen Erisman author

    Thanks for pointing out that 5v vs. 3.3v DOES make a difference when we consider the impact of the output voltage levels compared to the expected input levels of a 5v component instead of just looking at the rate of delay. According to ( a standard 74HC595 (CMOS) driven at 5v needs at least 3.7v to register a HIGH, and at most 1.3v to register a LOW. A 3.3v device can't handle the 3.7v requirement even without any delay at all. Using a 74HCT595 (CMOS w/ TTL Inputs) or 74LS595 (TTL/bi-polar) this lowers to at least 2v for a HIGH and at most 0.8v for a LOW. This is within the range of a 3.3v device, but it would again drop out of range once the capacitor loses about 40%, whereas a 5v device will still be at around 3 volts after loosing 40%. It wouldn't drop out of range (under 2v) until a 80% lose (or almost twice as long). So a 3.3v device, when driving a 5v device WILL need a shorter (clock) delay to stay in spec. You could probably keep the same delays by driving your shift register at 3.3v (instead of 5V) and potentially using a 3.3v capable LCD. Out of curiousity which 74xx595 are you using for your 3.3v device, and are you driving it's VCC at 5v or 3.3v?

    I'm sorry that you found my circuit/comments confusing. After reading through your comment, I realize that (for whatever reason) I am more used to a LSB first convention when using shift registers, but the existing SR2W code is using the MSB first convention. I can change this around to be more consistent. I will send an updated pull request later tonight when I have a chance (I'm at work right now).

    I'm not sure what your comment about QH vs. Q'H is all about. I re-read through my comments (in the .h file) and found everything properly references Q'H (not QH) which is correct. We want the non-latched output to be the trigger to the Latch/EN pins. This pin is also referred to as the serial out pin, as it is commonly used to drive the next shift register in a chain.

    Also, I'm not sure what to make the code snippet that you posted as it doesn't tell the full picture. I'm not sure what schematic you are using at this point, and I don't know what your delay values are, and the code doesn't seem to take the same approach mine does (with regards to clearing the shift register, etc..). Can you post more information?

  8. Stephen Erisman author

    Here is an example of how careful we have to be with the first timing constraint I talked about earlier...

    Originally I had code like this:

      fio_digitalWrite_HIGH(_srRegister, _srMask);
      fio_digitalWrite_LOW(_srRegister, _srMask);

    This generates the following assembly (I cleaned it up and added comments to make it more clear. Clock cycles are in parens.):

    in  r18, 0x3f   ; (1) Get interrupt flags into r18
    cli         ; (1) Disable interrupts (if they were enabled)
    ; *_srRegister |= _srMask; // fio_digitalWrite_HIGH
    ldd r30, Y+11   ; (2) Load _srRegister pointer into r30 / Z:0
    ldd r31, Y+12   ; (2) Load _srRegister pointer into r31 / Z:1
    ld  r24, Z      ; (2) Load _srRegister pointer's value into r24
    ldd r25, Y+13   ; (2) Load _srMask into r25
    or  r24, r25    ; (1) r24 = r24 OR r25
    st  Z, r24      ; (2) Store r24 into *_srRegister (this actually toggles the pin)
    ; *_srRegister &= ~_srMask; // fio_digitalWrite_LOW
    ldd r30, Y+11   ; (2) Load _srRegister pointer into r30 / Z:0 
    ldd r31, Y+12   ; (2) Load _srRegister pointer into r31 / Z:1
    ld  r25, Z      ; (2) Load _srRegister pointer's value into r24
    ldd r24, Y+13   ; (2) Load _srMask into r25
    com r24     ; (1) r24 = One's Complement of r24
    and r24, r25    ; (1) r24 = r24 AND r25
    st  Z, r24      ; (2) Store r24 into *_srRegister (this actually toggles the pin)
    out 0x3f, r18   ; (1) Restore interrupt flags (re-enables interrupts if they were enabled)

    Let's notice a few things here.

    1. There are 12 clock cycles between the pin going HIGH to it going LOW (i.e. the number of cycles between the ST operations)!

    2. There is a lot of redundancy here.

    My new/current (optimized) code is as follows:

    fio_digitalWrite_PULSE_HIGH_THEN_LOW(_srRegister, _srMask);

    which is a macro that breaks down to:

      uint8_t fio_a, fio_b;
      fio_a = *_srRegister;
      fio_b = fio_a | _srMask;
      *_srRegister = fio_b;
      *_srRegister = fio_a;

    Which generates the following assembly:

    in  r18, 0x3f   ; (1) Get interrupt flags into r18
    cli         ; (1) Disable interrupts (if they were enabled)
    ; fio_a = *_srRegister;
    ldd r30, Y+11   ; (2) Load _srRegister pointer into r30 / Z:0
    ldd r31, Y+12   ; (2) Load _srRegister pointer into r31 / Z:1
    ld  r25, Z      ; (2) Load _srRegister pointer's value into r25
    ldd r24, Y+13   ; (2) Load _srMask into r24
    ; fio_b = fio_a | _srMask;
    or  r24, r25    ; (1) r24 = r24 OR r25
    ; *_srRegister = fio_b;
    st  Z, r24      ; (2) Store r24 into *_srRegister (this actually toggles the pin)
    ; ??? (why does it need to re-load the Z register/pointer?)
    ldd r30, Y+11   ; (2) Load _srRegister pointer into r30 / Z:0 
    ldd r31, Y+12   ; (2) Load _srRegister pointer into r31 / Z:1
    ; *_srRegister = fio_b;
    st  Z, r25      ; (2) Store r25 into *_srRegister (this actually toggles the pin)
    out 0x3f, r18   ; (1) Restore interrupt flags (re-enables interrupts if they were enabled)

    Let's notice a few things again.

    1. There are now only 6 clock cycles between the pin going HIGH to it going LOW (i.e. the number of cycles between the ST operations).

    2. We have removed most of the redundancy.

    3. For some reason the compiler is still reloading the Z register/pointer. If we could get rid of this we would go down to 2 cycles between the pin going HIGH to LOW!

    @16MHz, 12 clock cycles is 0.75 uS and 6 clock cycles is 0.375 uS. 2 clock cycles would only be 0.125 uS.

    During this time period, we have to make sure that the capacitor doesn't discharge below the shift register's minimum input HIGH (V'IH in the datasheets) or charge above the maximum input LOW (V'IL in the datasheets).

    The slower the clock speed or the more clock cycles we waste, means we need a longer RC time constant.

    For me, 12 clock cycles didn't work reliably @16MHz, but 6 seems stable. 2 would be ideal but would probably require some assembly magic.

  9. Bill Perry

    I got it working on my m328 based Arduino. I had a silly error of using a 10k for one of my resistors. (Red and orange look very close under the flourscent lights I have in my office)

    The new stuff is Very nice, and very clear especially the pin 12 to pin 9 connection. Great creativity. I really like the new ASCII art and the MSB consistency. This strategy is quite a bit faster than the strategy I was using, which was an optimized version of Roman's.

    I never really looked that closely at the variation between different vendors 74HC595 and different. I had been using the NXP datasheet which shows min ViH of 3.15v when VCC is 5v with a typical value of 2.4v which is why it still works with 3.3v but the HCT would be a much better choice. The solution I was looking at moving to (but never finished) was some sort of dual driven solution that drove the output to the RC network both directions from gnd to full 5v (vcc) with either a 5v or 3.3v high output. But it is more components than the simple RC network directly driven by the MCU output pin.

    The loadSR() code posted earlier was simply the shiftout routine as an example of how to efficiently do a loop for MSB shifting out. The AVR generates slightly better/faster code if shifting the value and masking with a constant mask vs masking the value with a mask value that is changing/shifted. I also didn't want to depend on the fio code since it was really simple to just get the pointer and hit the register inline to control the pin (vs having to call a function) and I wanted to have control over the interrupt mask register. This method is portable across processors (AVR, pic/mics, and ARM) and is the fastest way to control pins other than using hard coded pin assignments on AVR which gets down to single clock.

    I drove the 595 and LCD with 5v vcc on both pic and AVR platforms. The MCU outputs were connected to the circuit (RC network) directly so on pic/mips you got 3.3v and on AVR you got 5v inputs going into the 595.

    The solution I was looking at moving to (but never finished) was some sort of dual direction driven output solution that drove the output to the RC network both directions from gnd to full 5v (vcc) with either a 5v or 3.3v high input signal. But it is more components than the simple RC network directly driven by the MCU output pin.

    The code I wrote didn't use the "AND" gate diode. It was basically a highly optimized version of Roman's schematic from this page: Roman's shift1 system my pins were the same as his:

    // Bit #0 - N/C - not connected (cannot be used - will always be 0)
    // Bit #1 - connects to NPN transistor for backlight (if used)
    // Bit #2 - connects to RS (Register Select) on the LCD
    // Bit #3 - connects to E on the LCD
    // Bit #4 - connects to d4 on the LCD
    // Bit #5 - connects to d5 on the LCD
    // Bit #6 - connects to d6 on the LCD
    // Bit #7 - connects to d7 on the LCD

    I reduced the RC times by alterning the caps and resistors to get the times down to 3, 5 and 10us. I used cycle accurate timing by not using the Arduino delayMicroSeconds() routines and instead using the pic internal timer register or some cycle accurate delay routines on AVR. Because E was driven by an SR output it means having to send 4 bytes to the SR instead of 2 for each byte to transfer the two nibbles to the LCD.

    Also, I found the 595 is horribly noisy when run in "non latching" mode. The latching cycle along with the shift on each clock can easily cause ground bounces which corrupts your interface. I had to add extra decoupling caps to fix this. Only latching the part on the full bytes should really help out. Although I'm seeing what looks like glitches on the E line going into the LCD on my logic analyzer - same as what I would occasionally see in my implementation. I'll have to look at this a bit more to see what is going on.

    For the average user, it might be worth updating or adding a backlight circuit that uses a simple 2n3904 instead of the mosfet as they are probably easier to get for most people. (I assume that is what was used on your strip board?) I used the 2n7000 on SR2W since it was easy to use to remove the flicker because of the toggling output bits because it didn't load down the RC network feeding the gate/base.

  10. Stephen Erisman author

    Wow. That tip about a faster way to do the shifting made another big difference!

    I'll post a code modification in a bit, but here are the new Benchmark II results:

    • Interface: SR1W

    • ByteXfer: 158uS

    • 16x2FPS: 185.93

    • Ftime: 5.38ms


  11. Stephen Erisman author

    As far as the backlight control...

    I actually use a simpler version that is just a 1k resistor going to the base of a NPN (i.e. pn2222a) with the collector tied to the Cathode, and the emitter tied to GND. The Anode is then connected through a resistor (100 or so ohms) to VCC.

    I kept the schematic from SR2W in the documentation because I didn't want to worry about why it was the way it was. I figured people might already be used to that one, and it shouldn't hurt to continue using it.

    But, I agree the NPN version is easier/cheaper.

    I can either update the schematic to include both versions, or swap it out for the NPN version. Thoughts?

  12. Stephen Erisman author

    By the way, you know there is a different 2-wire version that uses this same diode "gate" latching concept, right?

    Here is the schematic I was using (until I switched to 1-wire a few days ago that is):

    //                         74HC595     (VCC)
    //                       +----u----+     |
    // (LCD D7)------------1-|QB    VCC|-16--+
    // (LCD D6)------------2-|QC     QA|-15
    // (LCD D5)------------3-|QD    SER|-14--------------------------+---(Data PIN)
    // (LCD D4)------------4-|QE    /OE|-13--(GND)          1k       |
    // (BL Circuit)--------5-|QF    RCK|-12---------+--[ Resistor ]--+
    //                       |         |             \
    // (LCD RS)------------6-|QG    SCK|-11-----------)------------------(Clock PIN)
    //                     7-|QH   /CLR|-10--(VCC)   /
    //                  +--8-|GND   Q'H|--9---|<|---+
    //                  |    +---------+     diode  |
    //                  |                           |
    //                  |      0.1uF                |
    //                (GND)-----||----(VCC)         |
    //                                              |
    // (LCD EN)-------------------------------------+
    // (LCD RW)--(GND)

    Here is the stripboard layout, which is very similar to the 1-wire version mostly because this is what I started with ;)

    Because it makes use of latched outputs, you can get by with the NPN transistor driven backlight control on this one as well.

    I personally think this is a better design than what is in SR2W, but I've moved on to 1-wire now ;)

  13. Stephen Erisman author

    Hmmm... I have another (untested) idea that could potentially speed this up even more.

    I think we could take ((QH AND Q'H) NAND VCC) and control the /CLR line with it's output. We would then be able to get rid of the second RC and the diode AND "gate" and instead drive Latch and EN directly off of Q'H. We could then remove the entire clearSR() method which accounts for two 5 uS delays (per nibble). That would save another 20 uS (plus overhead) per byte send.

    It would work like this.

    1. Shift in a '1'.

    2. Shift in the other 7 bits.

    3. At this point the first '1' is shifted into Q'H. It then directly drives the Latch and EN high.

    4. The Latch outputs all the SR bits which makes sure QH is high as well.

    5. (QH AND Q'H) will be high now, which NAND'd with VCC will now be low.

    6. This will cause the shift register to be cleared.

    7. This will in turn bring Q'H high again, which will drive Latch and EN low again. (Would have to make sure EN doesn't go low to quickly for the LCD to pick up the data. Might need to add a delay in here using a capacitor?).

    8. (QH AND Q'H) will now be low again, which NAND'd with VCC will now be high again (and stay high until the cycle repeats).

    ... or at least that is the theory.

    So, we would start with a diode-resistor AND "gate" and then add a resistor-NPN-resistor NAND "gate" (similar to what is shown here: It would only take a few more parts than what we are already using.

    If this actually works, we could easily have the code support either option.

  14. Stephen Erisman author

    IT WORKS!!!

    Here is the schematic I have wired up on my breadboard:

    //                         74HC595     (VCC)
    //                       +----u----+     |          2.2nF
    // (LCD D7)------------1-|QB    VCC|-16--+      +----||----(GND)
    // (LCD D6)------------2-|QC     QA|-15         |
    // (LCD D5)------------3-|QD    SER|-14---------+--[ Resistor ]--+
    // (LCD D4)------------4-|QE    /OE|-13--(GND)         1.5k      |
    // (BL Circuit)--------5-|QF    RCK|-12---------+                |
    // (LCD RW)--(GND)       |         |             \               |
    // (LCD RS)------------6-|QG    SCK|-11-----------)--------------+----(Serial PIN)
    //                       |         |              |
    //             +-------7-|QH   /CLR|-10-----------)--+--[ Resistor ]--(VCC)
    //             |         |         |             /   |       1k
    //             |    +--8-|GND   Q'H|--9---------+    |
    //             |    |    +---------+            |    |
    //             |    |      0.1uF                |     \
    //             |  (GND)-----||----(VCC)         +------)--------------(LCD EN)
    //             |                                |     /
    //             |------|<|------+--[ Resistor ]--|    |
    //                   diode     |       1k            |
    //                             |                     +---||---(GND)
    //                             |                     |   2.2nF
    //                             |                     C
    //                             |                     |
    //                             +--[ Resistor ]---B-|> (NPN)
    //                                     1k            |
    //                                                   E
    //                                                   |
    //                                                 (GND)

    It works with the code as-is, or you can comment out the call to clearSR for faster use.

    Here are the new Benchmark II results with clearSR commented out:

    • Interface: SR1W

    • ByteXfer: 112uS

    • 16x2FPS: 263.35

    • Ftime: 3.80ms

    I'm not sure why it is 46uS faster instead of just 20 uS faster. There must have been a lot of overhead in clearSR().

  15. Bill Perry

    On the backlight circuit, I think it would be useful to include a NPN (2222a/2n3904) design since they are usually easier to acquire. For my test setup with your code, I actually used a 2n3904 with a 10k resistor to the base. I had orignally wanted to use this for the SR2W but because the outputs were not latched, the backlight circuit needed to have some persistance to keep it from flickering. The only thing that was simple that really worked was to switch to a mosfet since there is no load from the gate. This allowed the RC network to slow down the tranisitions during the shifts operation. Without it the backlight would flicker. The issue with using the same circuit on a normal NPN transistor is that the current flowing into the base alters the RC timing and didn't really work. The flickering was originally much worse. Now that the shift out operation is so fast it isn't that bad with out the RC network. But in fallback mode it is really bad. Maybe include both. When the outputs are latched, the resistor and capacitor going to the gate on the mosfet can be eliminated and the backlight output pin can connect directly to the gate. The mosfet will save a bit of power as well over the NPN transistor since there is no current needed to turn it on. (I used a 10k resistor on the NPN transistor vs a 1k to save a little bit of power) I hadn't seen the alternate 2 wire circuit you showed. I agree that it definitnely is better for a 595 or maybe even a 4094 but it wouldn't work for a 74ls164 The 164 is pretty wimpy.

  16. Bill Perry

    That new circuit is totally awesome. The transfer rate is now a little over 3 times faster than the original LiquidCrystal library that uses 6 pins.

    It would be nice to have the ability to select either one from the sketch. Maybe different constructor names for each? and then set a flag in the code to skip the clearSR() ?

    One other tiny bit of performance boost would be to select consecutive output bits on the SR for the LCD data pins. This would allow a shift on the data vs having to OR in each bit. Best would be D4-D7 on bits 0-3 as the compiler may be able to use the swap instruction to move the nibbles around.

    It would mean that you loose the ability to configure the output pins on the SR, but do you think anybody will really want/need to that? Just a thought....

  17. Stephen Erisman author

    Actually there is a 74HC164 version of this that is similar.

    Here is what I came up with a while ago (until I switched to the latched 74HC595 version):

    //                  +--------------------------------------------+--(Data PIN)
    //                  |                                            |
    //                  |      74HC164     (VCC)            1k       |
    //                  |    +----u----+     |      +--[ Resistor ]--+
    //                  +--1-|A     VCC|-14--+      |
    //                  |    |         |            +------------------(LCD EN)
    //                  +--2-|B      QH|-13         |
    // (LCD D7)------------3-|QA     QG|-12---|<|---+           (GND)--(LCD RW)
    // (LCD D6)------------4-|QB     QF|-11----------------------------(BL Circuit)
    // (LCD D5)------------5-|QC     QE|-10----------------------------(LCD RS)
    // (LCD D4)------------6-|QD   /CLR|--9--(VCC)
    //                  +--7-|GND   CLK|--8----------------------------(Clock PIN)
    //                  |    +---------+
    //                  |
    //                  |      0.1uF
    //                (GND)-----||----(VCC)

    And here is the stripboard layout that I briefly used:

    I gave this up a while ago because of the backlight issues. I never thought to use a mosfet with RC stabilizer. I might have to go back and try that out.

  18. Stephen Erisman author

    As far as turning the clear code on/off... I was thinking something along the same lines. Probably just a value that you pass into the constructor. 'SkipClear' or something like that, that defaults to false.

    For the consecutive output bits... I for one like them to be configurable, because it makes it a LOT easier on the stripboard layout. Other than to prove it can be done, I don't usually NEED the extra speed. We could probably have the code check if they are consecutive, however, and switch to the faster mode if they are. It should be a pretty easy check to do. We might even be able to do it as a compile time check using the preprocessor.

  19. Stephen Erisman author

    Hmmm... now that I look at it, my 74HC164 circuit is basically the same as what is already in the SR2W file.

    Why did you think it wouldn't work very well?

  20. Bill Perry

    What I meant was you couldn't do the "different 2-wire version that uses this same diode "gate" latching concept" becasue the 164 doesn't provide latching outputs or the serial output.

    Yeah, I pulled my hair out for a bit on the backlight flicker. Took me a while to lock on to using the mosfet with the RC network.

  21. Stephen Erisman author

    Earlier when I ran the benchmark on the new circuit, I missed another delay that can be skipped.

    Here are the new Benchmark II times with it commented out:

    • Interface: SR1W

    • ByteXfer: 103uS

    • 16x2FPS: 285.23

    • Ftime: 3.51ms

  22. Stephen Erisman author

    I added (most of) the new code/examples to this pull request.

    I still need to work on the performance optimization around bit order. That will have to come tomorrow, as I need some sleep right now.

  23. Stephen Erisman author

    I found some more optimizations that reduced the time for both circuit versions.

    I also removed an extra (unneeded) resistor from the HW_CLEAR version and moved the capacitor to the base of the NPN.

    Here are the new Benchmark II results:

    • Interface: SR1W HW_CLEAR / SW_CLEAR

    • ByteXfer: 92us / 135us

    • 16x2FPS: 318.59 / 217.28

    • Ftime: 3.14ms / 4.60ms

    The pull request has been updated.

  24. Bill Perry

    A couple of final comments before the code is committed.

    I noticed that in the schematics, the resistor in the lower RC network in the HW_CLEAR version uses a 1k resistor vs a 1.5 k resistor. Was this intentional?

    Also, with respect to constructors, would it be useful to have a "default" constructor or perhaps have a default circuit type so that sketches could use:

    LiquidCrystal_SR1W lcd(pin#);

    To use the SW_CLEAR hardware with POSITIVE backlight control. Maybe with a constructor of:

    LiquidCrystal_SR1W (uint8_t srdata, t_sr1w_circuitType circuitType = SW_CLEAR, t_backlighPol blpol = POSITIVE);

    I like it, but I'm wondering what others think?

    --- bill

    1. Stephen Erisman author

      The 1k resistor going to the base of the NPN was intentional. At the time I was testing this, I wanted a quicker transition for the clear operation. It might be fine at 1.5k still though. I haven't specifically tested it at that value.

      I could go either way on the constructor. The way it currently is, it prompts someone to check which version they are actually using. Otherwise they might end up using the SW_CLEAR code against the HW_CLEAR circuit (which should work) and either not notice, or wonder why it was slower than needed. Do you have any additional thoughts on this?

  25. Bill Perry

    I can go either way as well on the constructor. My only thought was that it might be simpler, since that simpler constructor would work on both versions of the hardware, but as you say people may end up running slower than they could be if they properly filled in the constructor. In the larger pricture, it might be better to force people to fill in the proper constructor rather than have defaults. I see many forum posts where people can't get their I2C interface to work because they are using the default I2C constructor rather than filling in all the information. It is a two edged sword. As the constructor gets more complex, users screw it up, and then when there are defaults to make certain cases easier, users use the defaults and then complain that it doesn't work. I don't see an obvious answer.

    I do have a little piece of code in LCDiSpeed that wil need updating for this updated SR1W constructor but its not really a big deal, and I have other udpates for it as well anyway.

    I guess I should give this latest SR1W stuff a try on the chipkit. I'd be surprised if it works "as is", since the RC timing changes because of the 3v outputs. However, it should be easier to correct the delays since there is only 1 delay involved instead of 2. I'm guessing that the code can simply alter delay times at compile time based on the MIPS processor. I'll try it and report back.

  26. Francisco Malpartida repo owner

    Sorry folks, I haven't been too active on this pull request.

    Comments and suggestions: 1. I would be inclined to have a default constructor for one of the two circuits but if you look at the comments and questions that people make you end up dropping the idea. I think that we could maintain the constructor as is and put in the project's wiki the circuit and how to initialize it. Not that we will not get questions from not reading the documentation as people have a tendency of plugging the circuit and asking why it doesn't work. 2. I think that the whole circuit will work just as good with a larger resistor but again in the ASCII circuit it can be kept. I will build it up and try to get it going with a FET too. 3. We need to get an example or two in the repo.

    I will draw up an schematic for the wiki and send it over for you folks to review in the coming days.

    I am still trying to figure out how to merge the pull request into the main branch. Have any of you done it before?

    I like this contribution and would like to get 1.3 packaged within the coming weeks so that we make this work available to the community.

    1. Stephen Erisman author

      There should be a Merge button next to the Edit and Decline buttons.

      More info at the bottom of this page:

      But it looks like there might be some other issue that has to be cleared up first. At least on my screen there is a message: "Bitbucket detected multiple heads on the destination branch. Merge all the heads on the destination branch. Then, try to merge the request again.". I'm not sure if that needs to be fixed first. I don't have a merge button on my screen but that could be just because I don't have write permissions for the repository. Worst case you should be able to manually pull the changes to your local system and then check them in (instructions are in the above link).

  27. Bill Perry

    A quick bit of feedback on the chipkit - It caused me to take a closer look at the code and discovered some issues.

    As expected it didn't work "as is". After quite a bit of fooling around with the logic analyzer and pulling my hair out as to why the pulses were not working, I found an issue in the fio_digitalWrite_PULSE_XXX macros. fio masks and registers must be declared using the fio types rather than uint8_t types because not all processors use 8 bit registers for their i/o registers. So the proper declarations needs to be:

    #define fio_digitalWrite_PULSE_LOW_THEN_HIGH(reg,bit)   ATOMIC_BLOCK(ATOMIC_RESTORESTATE) {fio_bit fio_a, fio_b; fio_a = *reg; fio_b = fio_a & ~bit; *reg = fio_b; *reg = fio_a; }
    #define fio_digitalWrite_PULSE_HIGH_THEN_LOW(reg,bit)   ATOMIC_BLOCK(ATOMIC_RESTORESTATE) { fio_bit  fio_a, fio_b; fio_a = *reg; fio_b = fio_a | bit; *reg = fio_b; *reg = fio_a; }

    With that fix the pin is now toggling "better" but code is still not working. I'll need to look closer at the timing. I will say that I'm not really a fan of using digitalWrite_PULSE_xxx macros. The reason is that for applications like the code in clear_SR() they are not very efficient. This is because they are maintaining interrupt enable state for each operation. In something like a fast loop, like what is in clear_SR() it is much better to set the atomic state outside the loop then use the faster i/o calls which don't mess with the ISR enable state. The loop is plenty fast enought that the time interrupts are masked shouldn't create any issues.

    Also, while the new fio_digitalWrite_PULSE_XX macros are ensuring atomicity, none of the other fio_digitalWrite_XXX macros do this.

    The current SR1W code has some issues with respect to this as it is currently using fio_digitalWrite_LOW() and fio_digitalWrite_HIGH() without ensuring atomicity. This will cause issues (clobbering i/o registers) on AVR based processors when ISRs (like the servo library) use digitalWrite() calls inside their ISR. The way to fix this is to mask interrutps or use the ATOMIC wrappers around them. You can see an example of this in the sample loadSR() function example I showed earlier. In my case I used interrupts() and noInterrupts() and did the ORed/ANDed assignments directly, the code could use the ATOMIC wrappers and fio_digitalWrite_HIGH() and fio_digitalWrite_LOW() instead.

    My personal feeling is that It seems odd and inconsistent to have the new PULSE macros handling atomicity vs none of the others. The others intionally didn't do it since it allows faster code to do the atomicity elsewhere. My suggestion would be to actually eliminate the new PULSE macros and just use the existing primitives. Obviously the code needs some other updates to ensure atomicity. While the code will take a small hit for the added overhead of ensuring atomicity, there is no other way to ensure atomicity on the AVR when using runtime variables to control which Arduino pins are used. My bet is that it will actually be a wash or perhaps even faster for in SW_CLEAR mode as moving the atmoicity outside the loop in clearSR() should make the overall code faster.

    1. Stephen Erisman author

      Bill, I will have to look into this further.

      Moving the ATOMIC wrappers out of the PULSE macros and into the loadSR and/or clearSR shouldn't be an issue as long as we think it is reasonable to have interrupts disabled for that long of a time. Right now they are disabled for the minimum amount of time. Also, if we remove the ATOMIC wrapper out of the PULSE macro, the macro won't be reliable if it is used in a code block that doesn't first disable interrupts. This puts the responsibility on the caller to know about the details of the macro which seems messy to me. (Kind of like how I introduced a bug in this code by using the HIGH/LOW macros outside of an ATOMIC wrapper without realizing the implications.)

      Also, eliminating the PULSE macros and going back to the HIGH/LOW macros isn't really an option. It doubles the time for the quick transitions and didn't work reliably for me (see my comment above that starts with "Here is an example of how careful we have to be with the first timing constraint I talked about earlier...").

      I suppose I could eliminate the PULSE macros and move their logic into the loadSR and clearSR methods directly, but this would make those methods more complex and harder to read. How would you feel about me moving the PULSE macros out of FastIO and make them local macros instead? If they were only going to be used by SR1W I would also feel more comfortable moving the ATOMIC wrappers out of the macros and putting them directly into loadSR/clearSR. The big downside being that interrupts would be disabled for the whole duration of the method.


      1. Bill Perry

        The interrupt blocking in the clearSR() shouldn't be very long even if it's around the full loop. In the normal loadSR() it can be 10's of microseconds which still probably is acceptable. None of the other macros handle atomicity. It was left to the users to handle which makes the code more efficient. I think originally they were intended to be helpers for the other fast i/o code.

        HIGH/LOW shoudn't be slower than the PULSE that seems odd to me. My guess would be that it wasn't reliable because every now and then they were interrupted because the "pulse" using HIGH/LOW macros was not atomic rather then them not being fast enough.

        I'm not sure I agree that adding the ATOMIC stuff into the code makes it that more complex or harder to read. You can see what it would look like by looking at the loadSR() routine I posted earlier. The noInterrupts()/interrupts() would be replaced with an ATOMIC block and then the |= would be the HIGH macro and the &= would be the LOW macro.

        But in the end it is more a matter of taste. I think if local SR1W macros can improve readiability, then that's ok. You may also need macro wrappers aound the fast i/o LOW and HIGH macros so then you may end up with two sets of LOW and HIGH macros 1 with atomicty and one without depending on how the atomicity is handled within the loops.

        1. Stephen Erisman author

          Study the actual compiled code. Using the default settings for the Arduino 1.0.5 IDE, the HIGH/LOW code is definitely spending twice as long (12 cycles) with the pin toggled than the PULSE code (6 cycles). Maybe there is some optimization option that could be enabled, but that's the way it currently compiles. This has nothing to do with the ATOMIC stuff as in either case the calls were inside an ATOMIC wrapper.

          But, I will try and come up with something that address all these issues/concerns later tonight.

  28. Francisco Malpartida repo owner

    I am OK either way. I don't think that putting the ATOMIC operation would make the code more complex since they would end up like a block in the code giving a clear very of what is in the critical region.

    In any case, and as Bill suggests, it's a question of taste. The rest of the macros don't wrap the critical region, leaving to the author the management of critical regions.

    My call here would be to be consistent with the rest of the code.

    1. Stephen Erisman author

      Thanks for the feedback guys.

      I will try and come up with something that addresses these issues/concerns later tonight (I'm at work right now).

      At the very least, I need to fix the HIGH/LOW calls to make sure there is an ATOMIC wrapper around them.

  29. Stephen Erisman author

    I updated the code to fix the issue with using fio_digitalWrite_LOW() and fio_digitalWrite_HIGH() without ensuring atomicity.

    I also removed the PULSE macros from fastio and hard-coded the logic in SR1W. Please verify that this code is clean enough and easy enough to read.

    I found a few more performance optimizations (mostly related to 'caching' class fields as local variables) that really help out the SW_CLEAR version. This approach also brings down the number of cycles spent in a quick transition from 6 down to 2 (probably as good as it gets) without requiring hand written assembly.

    Here are the current Benchmark II results:

    • Interface: SR1W HW_CLEAR / SW_CLEAR

    • ByteXfer: 94us / 116us

    • 16x2FPS: 311.41 / 252.83

    • Ftime: 3.21ms / 3.96ms

    The pull request has been updated.

  30. Bill Perry

    One thing that I just noticed is that there is no recommendations for the NPN transistor and the diode in the schematics. I know we wouldn't have any issues selecting parts, and there is reference to a 1n4148 diode in the text above but I think it would be helpful to put it in the actually ASCII schematic to make it fully self contained.

    1. Stephen Erisman author


      It looks like someone added the benchmark results to the wiki, but used the first set of results that have since been refactored and optimized many times. There are also now two versions of the hardware that have different benchmark results. Should both be added to the wiki? I can get current benchmark results in a few days if needed (currently using my arduinos for another project).

      1. Francisco Malpartida repo owner

        Yes, that was me. I got very old results. I have updated Benchmark II and update the performance ratio of the SR too. I will try to get a schematic loaded over the coming days.

        I am still trying to figure out how to kill one of the default branches that picasso created. I think it is the one causing the problem.