Troubleshooting intermittent problems

Discussion:

(too old to reply)

peppergrower

2008-08-01 00:10:05 UTC

I'm encountering an intermittent problem in a VI that monitors the state of a piece of equipment.  The code is mildly complex, and it was also written by someone else, which complicates things a bit more.  I don't get an error, per se; the program just seems to lose communication with the equipment until I notice and hit a button to tell it to check for the equipment again--at which point it recovers just fine and continues to work. I'm familiar with most of the normal LabVIEW troubleshooting methods (highlighting execution, probes, breakpoints, etc.), but I don't know of a good way to troubleshoot problems that occur once every hour or two, and silently.  It's possible it's a GPIB error, but I usually get error codes for those, so I think it's something about the previous programmer's coding.  What are some good general methods for tracking down silent, intermittent errors?Thanks,Michael

johnsold

2008-08-01 01:40:17 UTC

Permalink

Michael, Intermittent problems are always the tough ones.You did not describe the program architecture. If you have a state machine (which would likely be a good choice for the type of program you described), you can build a little state tracer VI. It is somewhat like an Action Engine: A while loop with uninitialized shift registers. It takes the current value of the state selector enum as an input. If the current value is different from the previous value, it appends the value to an array.While the troublesome program is running, I launch an independent copy of the tracer VI. Whenever I want to check the sequence of states, I run the tracer VI and look at the array. It has been quite helpful in some projects.I have one version which monitors two parameters simultaneously, the state and a function enum. In thinking about it, it might be more useful to limit the number of past states recorded, like a circular buffer. After hundreds or thousands of state changes, searching for the problem area is tedious.Another idea: A watchdog. Set a little VI running as an independent loop. Have it watch for some repeatable indicator of activity. If the activity fails for a specified time, then simulate hitting the button which resets things.Lynn

peppergrower

2008-08-01 02:10:07 UTC

Permalink

Lynn,Very good suggestions.  Thanks a bunch for your input.  You're right; I didn't really describe the program architecture, and I apologize.  Along with monitoring the equipment, the program also allows me to change various settings on the equipment.  The previous designer has several event structures in parallel (in separate while loops, of course).  One of them checks various equipment indicators on a timeout event.  My first suspicion was that something was causing it to never time out, but the only other events it handles are (a) program end, and (b) the check to find the equipment (on a button push).  I have a couple of other leads, but I wasn't sure how to follow them up when the error is so infrequent.I like your idea of a tracer program.  I think it should give me the tools I need, assuming I can figure out which events and objects to keep track of.  My first thought is to capture the time each event fires in each of the event structures; possibly I'll see a pattern.  I also have an indicator that I want to log the state of, which will be trivial.  I think I know roughly what ends up happening when the problem occurs, so I can check for that condition and log it as well.  And if I can't track the problem down, a watchdog is a nice workaround.Does anyone have a good way to automatically detect which event in an event structure just executed, or will I need to feed a string constant out of each and record them that way?Michael

johnsold

2008-08-01 12:40:25 UTC

Permalink

Michael, I do not know of any way of tracking events other than what you have described, but I will admit that I have not had the need to check that.I would take a close look at the multiple event structures. It is generally recommended to use only one. Look for multiple structures responding to the same event or to related events on the same control (like Value Changed and Mouse Down). Is the front panel locked by any of the events? Do any event cases have code in them which might take a long time to execute or which might "interfere" in some way with one of the other structures?Lynn

Ben

2008-08-01 13:40:13 UTC

Permalink

Q1
Did this application ever work correctly?
If not, please post the code so we can review same. Race conditions are the most frequent cause of intermitant problems with a LV app.
Q2
If it used to work and now does not, I'd suspect hardware. Run NI-Spy to track instumnet traffic.
Q3
What verion of LV? LV 7.0 (or thre-abouts) had a double bug that would play off -each other and lead to lock-ups.
Ben

peppergrower

2008-08-03 08:10:04 UTC

Permalink

Lynn,Thanks; I'll be checking for those things.  Unfortunately, I realized that tracking this bug down might be more involved than I had thought.  This program is used to control an electromagnet, and is sometimes used as a standalone program.  However, it is also part of a comprehensive program that controls the rest of the instruments in the lab, and which does its own occasional communication with the magnet (to check if it's turned on, etc.).  It's possible that the overall program is somehow interfering when it performs its own status checks, etc.  I'll have to look at that as well.Ben,I don't think it ever did work properly.  Or at the very least, this bug has been around for quite some time.  Thanks for asking to see the code; I appreciate it if anyone feels like taking a look.  I created a source distribution (LV 8.2) and zipped it, which I've attached; the control program itself is Magnet v3.0.vi, and the event that does the routine status checks is in the timeout case of the upper-left event structure (which is in frame 1 of the event structure).   (I didn't include the comprehensive program mentioned above, since that one is huge.)  I apologize if the code is messy or unclear; I've made a few tweaks to fix bugs, but nearly all of it was written by the programmer before me.  It's also possible hardware is involved; thanks for the idea of NI-Spy.A few more general notes: I talked with one of the other people in the lab who uses the program quite a bit, and he pointed out that there are at least two bugs.  In one, the indicators for levels of liquid nitrogen and helium both drop to zero, but other indicators on the front panel continue to work properly.  In the other, all the indicators drop to zero and the temperature graph stops updating.  In both cases clicking the "check for magnet" button reestablishes communication immediately.

Magnet control program1.zip:
http://forums.ni.com/attachments/ni/170/346149/1/Magnet control program1.zip

Albert Geven

2008-08-04 04:10:16 UTC

Permalink

Michael Put each of your instruments into a separate action engine. (search for the article from Ben)and you are sure that no race conditions on an instrument can happen.Except race conditions on the bus....

johnsold

2008-08-04 14:40:05 UTC

Permalink

Michael, I see one thing which I find troublesome: Many of the event cases have code inside which can run for long times. In fact the entire program except for some initialization is inside event cases.The rule I use is that no code should execute inside an event case which could take longer than the minimum time between two events. I typically take that to be less than tens of milliseconds. Nothing which does any I/O, file operations, OS interactions, user dialogs, time delays, or other time consuming tasks are allowed inside an event case.Change the program to a state machine producer/consumer pattern with one event structure handling all user interaction and the second loop performing all other tasks. This will also enable elimination of the local and global variables.I suspect that race conditions between local/global variables or a blocked event structure or both are the cause of the problem.Lynn

peppergrower

2008-08-04 18:40:18 UTC

Permalink

Albert and Lynne, Thanks for the suggestions.  It sounds like the two ideas are similar in principle.  Lynne, I'm not familiar with producer/consumer architecture, but I did a little searching and found <a href="http://forums.ni.com/ni/board/message?board.id=170&message.id=236725" target="_blank">this discussion</a>.  I downloaded the example from Kalin, and after looking at it I created a model that puts the user input in the producer and the data acquisition in the consumer loop (attached, LV 8.2).  Do I have the right idea? Forgive my ignorance, but I'm also not sure how this particular setup would help avoid a blocked acquisition.  If one of the tasks halts (or takes a very long time), then everything else will still wait for that state of the state machine to finish executing.  It seems like it could still have similar problems to the current design.  I do like the architecture, though, for making sure things happen sequentially and helping to avoid race conditions.  How would it help avoid local and global variables?

Sample producer-consumer architecture.vi:
http://forums.ni.com/attachments/ni/170/346430/1/Sample producer-consumer architecture.vi

peppergrower

2008-08-07 22:40:05 UTC

Permalink

Does anyone else have suggestions, especially on using a producer/consumer architecture to solve some of the problems with the current design? Thanks,Michael

johnsold

2008-08-08 13:40:08 UTC

Permalink

Michael, In your Sample program changing the Mechanical Action on the buttons to Latch when Released (or Latch when Pressed) eliminates the need for the property nodes to reset the switches.The examples which come with LV are at File >> New.. >> VI >> From Templates >> Frameworks >> Design Patterns.Blocked acquisition: If you have multiple instruments on one GPIB bus, you have the possibility of conflicts. Use timeouts to limit how long you wait for a response. If you use a single VI to do the GPIB reads and writes and call it from various places for the different instruments, you can avoid having one try to read just after the other wrote. I think this may have been where Albert was suggesting the use of an Action Engine.One way to avoid having a state machine lock up in a time consuming state is to move the really time consuming part to a parallel loop and send commands and data between the two loops via queues. For example suppose it takes two seconds to change the magnet current. Then in the Change Magnet state send "Change Current to X" command via the command queue and set a Waiting for Magnet Change flag. Next state is still Change Magnet. If the event structure has sent a new command ("Lunch Break"), then the state machine can respond immediately to that. Otherwise it returns to the Change Magnet state where it checks the Wait flag and the response queue to see if the change has finished. It can repeat this state many times and then go on to the next state after the change is complete. It may only spend microseconds in the state before checking the event command queue, but seconds or hours before it goes to some other state.Lynn

peppergrower

2008-08-10 07:10:04 UTC

Permalink

Lynn, Thanks for the tips.  Those make sense, and I'll see about implementing them.  I've been debating rewriting the code, since it does mostly work, but it would be satisfying to (a) fix this bug, (b) implement a new-to-me design, and (c) have cleaner, more maintainable code.  I appreciate your help.Michael

Continue reading on narkive:

Search results for 'Troubleshooting intermittent problems' (Questions and Answers)

replies

The burners fail to light on my Tempstar gas furnace?

started 2008-01-31 06:20:12 UTC

maintenance & repairs

replies

where can I find "troubleshooting a Chrysler PT-Cruiser"?

started 2007-02-10 10:51:28 UTC

chrysler

replies

What is the problem with my garage electrical circuit?