Nightly 92: ASI 6200 sequence failed after 14 successful exposures

Issue #881 invalid

Ruediger created an issue 2021-06-12

Hello Stefan,

I am just testing the nightly built 92 and the advanced sequence has failed after 14 successful exposures. I am not sure how to interpreting the log and what has happened. I have attached it. The trouble started at line 797:

2021-06-12T23:40:48.6725|ERROR|SequenceItem.cs|Run|141|Category: Camera, Item: TakeExposure, ExposureTime 90, Gain 0, Offset 50, ImageType LIGHT, Binning 2x2 - Object reference not set to an instance of an object. at NINA.Sequencer.SequenceItem.Imaging.TakeExposure.<Execute>d__42.MoveNext()
--- End of stack trace from previous location where exception was thrown ---

Many thanks in advance!
Rüdiger

PS: I have restarted the sequence and see what will happen….

20210612-205637-1.11.0.1092.4300.log

Comments (13)

Ruediger reporter
- edited description
- 2021-06-12T22:33:18+00:00
Dale Ghent
line 795:
```
2021-06-12T23:40:48.6695|ERROR|ASICamera.cs|DownloadExposure|409|ASI: Camera reported unsuccessful exposure: ASI_EXP_FAILED
```
Something happened with the camera when triggering the exposure, with the ZWO SDK reporting a “exposure failed”

All subsequent attempts to trigger an exposure also failed in the ZWO SDK, with it reporting a general, non-specific error condition:
```
2021-06-12T23:40:48.6795|ERROR|ImagingVM.cs|CaptureImage|253|Error 'ASI_ERROR_GENERAL_ERROR' from call to 
```
‌
- 2021-06-12T22:37:37+00:00
Dale Ghent
The root cause is that the camera went out to lunch for some reason - SDK bug, connection issue, something of that nature well below and outside the visibility of NINA.

What is clear though is that we need that error handling strategy for low-level stuff like this so it can be properly consolidated and made available for something like a notification plugin to process, and/or for the sequencer to either suspend operations or something else.
- 2021-06-12T22:45:00+00:00
Ruediger reporter
Hi Dale,
many thanks for the quick reply. Ok, so I have read it correctly. There is no clue, what this “general failure” could have been. It is unspecific due to the ZWO SDK giving no detailed error. Any idea how to drill down to the root cause? After reconnect it works flawless for 27 exposures so for now.

Thanks,
Rüdiger

‌

PS: Our messages were overlapping.

‌
- 2021-06-12T22:46:45+00:00
Dale Ghent
Finding the root cause for your problem is something you’ll have to engage with ZWO on. We cannot provide support for the SDK-camera interaction; only advice based on our own experiences. Often, errors like this require the camera to be power cycled (with USB disconnection) in order to clear some bad state in the camera firmware. Sometimes you get off easy with a reconnect as in this case. It really depends. The SDK provides us with no detailed information about issues and we have to live with its vague “general error” kind of reporting. The actual cause can be anyone’s guess.
- 2021-06-12T22:51:45+00:00
Ruediger reporter
A plugin would be great. Some kind of error handling would be excellent. Looking forward to.

Just for explaining what has happend. Maybe this helps also for the plugins design:
The error caused that the sequence immediately jumped to the end. At he end I have a park, warm up and switching power off of all components (except mount). The camera was turned off before it could warm up (it was unresponsive anyway), but without power off it would have kept the temperature and sensor front glas heating. Due to the complete shutdown, the restart was much more komplex instead of stopping when the error occurred, reconnect camera and restart.

Thanks Rüdiger

PS (since messages were overlapping):
Totally agreeing: I am not expecting that the NINA team is trouble shooting any ZWO flaws. Never expected. We only have to get into a safe harbor after any unexpected hick up. Nothing more.

‌
- 2021-06-12T22:57:19+00:00
Ruediger reporter
- changed status to invalid
Issue closed. Root cause cannot not be identified and located in the ZWO SDK. Issue is not NINA related. Hence issue gets closed as not valid.
- 2021-06-12T23:04:02+00:00
Dale Ghent
It’s ultimately up to Stefan, but with a proper error handling mechanism, the sequence operations would be suspended and the operator would be notified through some mechanism, probably implemented by a plugin (GNS, email, SMS via some SMS gateway service, or other alert system such as the ones commonly used in the IT world.)

Now you might see why I advocate for humans to address error conditions (once properly notified) and to orchestrate the recovery Pressing on and automatically trying to “fix” things can make a bad situation worse and complicate the recovery process. In this case, falling through to the end-of-sequence section is just a result of the current lack of more involved error handling plumbing.

‌
- 2021-06-12T23:05:49+00:00
Ruediger reporter
I totally accept your valid argument, but a simple hit on reconnect was also the solution in this case

There was nothing else to do on my side (ZWO wont come here the next 10 Minutes and fix it ). After the first failure a clean disconnect and reconnect would have solved all my problems immediately, instead of running in an un-consistent system state at the end.

Never the less, any defined state after an error is ok. I think we can agree on that for sure.
- 2021-06-12T23:14:55+00:00
Dale Ghent
The problem is that we never know what will fix an issue and can never assume that any automatic course of action will be harmless if it doesn’t work. This fix could be a reconnect, or from other experiences it could involve the person going out to their camera and physically disconnecting the USB in order to deenergize the logic board and reset the camera’s internal firmware. I’ve even seen cases where the reconnect works, but the images that are downloaded afterwards have banding or are corrupted. From a programmatic standpoint, that can look like a recovery, but it isn’t. You cannot take one simple example and present it as a proof; the problems can often be more complex to deal with, hence the need for a human to carefully put things back together as you found out.
- 2021-06-12T23:48:07+00:00
Ruediger reporter
I understand the given examples, but it does actually not change the results: If I sleep or the gear is remote and it fails, the data is lost anyway. If I reconnect the device automatically and -to stay with you example- there is banding in, the result is the same: Lost data. But at least I had the chance to recover, when automatically reconnecting. If I assume it could introduce any risk, I do not reconnect. That's it.

In this example, also with mount disconnect, the vendor will not help. He leaves me with that problem alone. So I have to deal with work arounds. I have no chance to tackle the root cause. Like all companies they have never ever any issue. And like always in life, if you cannot solve a problem, you have to to live with it and mitigate the impacts. That’s all. That’s the way the cookie crumbles. For this type of error I can only reconnect and expect the best.

Fun fact: NASA does the same with their probes when failing. One of their default actions on fatal errors is to reset / restart the failed sub-unit. They life with same problem and fix very often in this way.

And if ZWO works fine, I just got “star linked”. Right in this moment Nice parallel lines. Definitely time for another beer….

Cheers!

‌
- 2021-06-13T00:11:20+00:00
Dale Ghent
You’re always assuming a rosy outcome. Our wider range of experiences say otherwise. I haven’t yet mentioned hardware getting so f*cked that the SDK crashed upon an attempt to reconnect which brought the entire application down with it. Luckily the user was right there of course and was able to bring everything back online. If the reconnect was automatic and unattended, there would have been a disaster. The application crash caused the mount’s ASCOM driver to quit because no other app was connected to it. This meant that the mount would have continued to track until it ran the camera into the pier because the ASCOM driver was no longer around to enforce meridian limits. Not a great outcome.

NASA’s systems are designed to be reset because the entire system is designed to operate that way. The systems we deal with are not; they are assemblies of random components of varying quality and often so-so standard compliance. NINA tries its best to glue it all together, but it depends on everything functioning properly to do this. Many error states that NINA can observe are vague and varied and cannot be relied on to properly diagnose a problem in an automatic fashion, or to determine the severity and likely outcome. Because of this, it’s always best to involve the operator when it comes to a fatal condition, even if the fix just happens to be a reconnect. At least the operator has a chance to ensure that nothing else is running away.
- 2021-06-13T00:28:40+00:00
Stefan B repo owner
- removed version
Removing version: 1.11 Nightly (automated comment)
- 2022-12-11T01:34:12+00:00
Log in to comment

Assignee: –

Type: bug

Priority: major

Status: invalid

Component: Business Logic

Version: –

Votes: 0

Watchers: 1