fusiongyro / dupfinder

A simple program to find duplicate files.

Clone this repository (size: 33.5 KB): HTTPS / SSH
$ hg clone http://bitbucket.org/fusiongyro/dupfinder/

Notes about Go

Impressions

First of all, after reading the language specification and the commentary in both the FAQ and the Language Design FAQ I have to say that the language does meet many of the designers' stated goals.

If you take a look at (p Path) Iter() in pathiterator.go you'll see I don't have any trouble converting a callback interface into a generator, even returning the channel immediately as another process goes about filling it. This shouldn't be very hard and it isn't, even though Go has more in common with C than Python, Python obviously has been a source of inspiration.

In (path ChecksumIterator) Iter() you see something which is hard to explain verbally. Essentially, it provides a generator façade for what is underneath the surface a bucket being filled by four threads ("goroutines") concurrently. Unless things have changed markedly, it would be very difficult in Python to create a generator which produces its values with the help of some number of threads. In my opinion, the most counter-intuitive thing about that function is using the channel as a semaphore. Yet it seems to work perfectly and this pattern seems to be recommended in Effective Go.

Syntactically, once reading the grammar and the rationale, it is much more regular than other languages of the C tradition. The death knell is officially sounded for languages that lack closures; even the die-hards in the Unix/Plan 9 camp seem to have found a situation in which they are useful enough they should be included.

For me, there are two mysteries left in this language that I'm aware of. The first is the relationship between slices and arrays. It seems like slices are present to be a kind of safe pointer complement to the fixed-size array notion. How to use them effectively as a storage mechanism I am still foggy on. I suspect I could have used a []string instead of a vector.StringVector and it may have been better, but the documentation is a little ambiguous as to whether or not slices that aren't really attached to any particular array can resize themselves when they exceed their capacity or not. It didn't seem to work the obvious way when I was playing with it.

The other remaining question for me has to do with the type system. I tried to make FindDuplicates(p Path) work using exp/iterable.Inject but the type casting situation confused me too much. It would have been nice for that function to have been replaced with a one liner to the effect of: iterable.Inject(ChecksumIterator(p), make(map[string] *vector.StringVector), iterable.Injector(combine)).

There's no exception handling mechanism here, but I can't really say I would miss it when I see code such as this:

#!go

func getHash(path Path) (result string, err os.Error) {
	// make a new hash calculator
	hash := sha1.New();
	
	// if we can open the file...
	if file, err := os.Open(string(path), os.O_RDONLY, 0); err == nil {
		// and if we can copy its contents to the hash
		if _, err := io.Copy(hash, file); err == nil {
			// then we have a result
			result = encodeBase64(hash.Sum())
		}
	}
	// return, whether we have nil for result or error
	return
}

I think this function really shows off the way two beautiful aspects of Go can work together to achieve a harmonious result. In this case, named return values couple with multiple return values to produce a function which either does the correct thing if things go well or passes the error directly back to the caller to handle if they don't. To me, this has the benefit of checked exceptions in Java that you know that an error may occur, but dealing with it is much simpler than in Java. You can simply ignore the error value if you're sure it won't be a problem, or you can use this if foo, err := ...; err == nil {...} pattern.

Another subtle detail hidden in this chunk of code is that the Hash interface implements the Writer interface, which means shuffling the data between the file and the hash there (the io.Copy) is actually independent of either the specific hash or the specific type of thing with data I'm hashing. You can't really achieve this kind of reuse in Java or most other languages (Haskell is a notable exception) because interface implementors must declare that they implement a particular interface. Go is much more future proof, because interface implementors don't specify that they are implementing a particular interface. If two types implement a method with the same type signature, an interface can be made with that signature and functions can be written abstracting over these two types. In other words, code can be applied abstractly and retroactively without modification, automatically. This is very powerful!

The pointer situation is subtly but surprisingly different from in C et. al. In Go, pointers are almost a way of separating destructive code from functional code. A method can modify its "self" value only if it is a method on the pointer to the type. Go transparently dereferences pointers during method resolution, so the net effect is that pointers to values get all usual dare-I-say functional methods as well as the destructive methods, whereas values just get the functional methods. With less syntactic burden, you get a step closer to FP for free while eliminating a huge source of C++ complexity and woe. This is the most insightful way of handling pointers and objects I've seen yet and I have no doubt will have excellent ramifications for code reliability going forward.

Making distinctions between modifiable and immutable instances of the same objects turns out to be really handy in Go, thanks to channels. To avoid contention and deadlock, it's important to be able to send values over channels instead of pointers to objects. Go makes this less scary than it is in languages like Java because there really aren't objects per se so much as type aliases to structs and other things. Features that make OOP complex and unwieldy for these purposes simply aren't there. There is no relationship between types around to let people trip over whether this type is abstract or a parent type of another type. There also aren't class-level variables and methods to worry about synchronizing. In Go you really just have values and pointers to values, and these values have a single type, which may be assignable to another type.

Overall

Have you read the C++ FQA recently? It opens with a really excellent lecture on C++ and how each one of its faults is exacerbated by the next fault.

The situation with Go is quite the opposite. The differences are subtle, but they add up to a very large difference, because of how they interact. Years of learning to write types the C way makes it hard to break the habit, but the consistency within Go makes the problem much easier. You always write the variable name, then the type, then the value. In some situations you can omit the type or the value but the overall order remains the same. This is true whether you're declaring a variable, a constant, a parameter, or whatever. Go never makes you write the same type twice on a line. Go changes inverts some bad defaults inherited from C. There's a fallthrough keyword that specifies you want this switch case to continue through the next one, because you want that feature about once for every ten times you don't want it.

Go includes a fair amount of sugar, but it's not the kind of sugar I'm used to from a Ruby background. None of the sugar in Go is ambiguous. You can put a single statement between your 'if' keyword and your condition, simply for readability and because in C one often would put an assignment inside an if statement. Because assignment isn't an expression, a whole class of errors in C are gone yet the syntax remains terse and readable.

The capitalization at first bothered me, but now I believe the reason for it is to enable a limited kind of the sort of encapsulation we've grown accustomed to since C++, but without the syntactic mess that goes with it in the form of private, protected, public etc.

Plan 9 from Google Labs

It's interesting to me to compare this to Plan 9 and Unix as well as to Limbo and (in a very limited way) Alef. Many features here in Go are present in Plan 9's C or Plan 9 itself. For example, Plan 9 uses the same compiler binary set convention of architecture character + compiler stage character ('6g', '8l' etc.) Go also comes with Plan 9-style Makefiles, ported to vanilla make. Examine the Makefile for this project:

include $(GOROOT)/src/Make.$(GOARCH)

TARG=dupfinder
GOFILES=\
        dupfinder.go\
        pathiterator.go\
        checksum.go\
        main.go

include $(GOROOT)/src/Make.cmd

Go is certainly more open than Plan 9 or anything else Bell Labs ever did. Plan 9, for example, has a fan mailing list, but there is no official mailing list sanctioned by the labs, and never has been. If you asked for a feature in Plan 9, you were usually told to go _ yourself, the blank usually being 'write it' but many other suggestions were also common. XML is the hated enemy, if you don't like the colors, edit the source code, Emacs is the devil, what other platforms, and 9P is the only protocol anyone will ever need.

Everything seems to be different with Go in these respects. There's a thriving mailing list, an XML library built-in, it runs on Mac OS X and Linux with ports underway for Haiku, FreeBSD and Windows, it comes with syntax highlighting for Emacs, Vi and Xcode, and there's no 9P library in sight.

On the one hand, it's nice to see some fresh air and a willingness to consider other people's ideas, but I do wonder if this says Plan 9 is feeling deader than usual. The 9fans mailing list seems to be eating this up but I see a lot of the usual suspects on the Go mailing list as well. Maybe it's just a cutting edge thing.

Weaknesses

From where I'm sitting, this is a fabulous foundation. Obviously there is a lot of work to be done in the coming years. I don't see any glaring omissions, though I wonder what will be done about generics, or if they will find a way around that problem akin to some of their other elegant solutions.

The big barriers for most people today are that it doesn't yet have integration with any other language, there is very poor editor support, no REPL, no debugger and no profiler. I see evidence that all of these things are coming along but they're not there yet. There are still baseline performance issues to be ironed out, but the performance today is better than lots of other things.

The type system is novel but I did find it occasionally to be a little counter-intuitive, particularly where anonymous functions are concerned. It took me several tries before I could come up with this classic closure demo:

#!go

package main

import "fmt"

func adder(i int) (func (int) int) {
	return func(x int) int { return i + x }
}

func main() {
	inc2 := adder(2);
	fmt.Printf("2 + 2 = %d\n", inc2(2));
	fmt.Printf("7 + 3 = %d\n", adder(7)(3));
}

This is, to my knowledge, the first compiled C-family language that can do this out of the box.

In Conclusion

For most of my friends, the next question is "should I be learning this?" Last week I said no. This week I'm saying, for most of you, not yet.

You should be learning this language if you're interested in languages. It's a case study in synergy. The whole system works together very well. I see more unintended benefits than side effects. Using it and learning it surprised me in a good way. I hope use it for some kinds of tasks I previously relied on C or a compiled functional language for.

You should also be learning this language if you're interested in parallelism. The model here isn't much different from Erlang's, except trading performance and type checking for fault tolerance.

I think if you're planning on writing certain kinds of server software, you might put this language on your to learn list. A stated goal of this project is to use it in Google's data centers as a protocol buffers server. If it's good for them and you have similar needs it might behoove you to spend some time on it. Perhaps Google will relent a little on their internal language fascism.

At the moment this is a language in need of a killer app. I don't think fast compilation alone is a particularly valuable selling point (though coding in Java by day, I'm starting to think it might not be a bad idea). Especially since there aren't any truly huge Go codebases at the moment and Go lacks things like shared libraries that could facilitate such things. Protocol buffers servers may be a large enough niche that Russ et. al. can eke out an existence for this language but the rest of the world will probably need more reason.

In its current state, Go is probably useless for systems software unless you're happy to go through system calls and eschew whatever libraries you are accustomed to using. I suspect this will be rectified soon.

I hope you will take a look at it. It is a surprising language.