DotNet First Side: Supporting LINQ in C# 3.0

Generics, delegates, and anonymous methods in C# 2.0 provide insight into how LINQ is supported in C# 3.0. Let's consider once again the problem of finding all doctors living within the Chicago city limits. As shown earlier, here's the obvious approach using foreach:

List<Doctor> inchicago = new List<Doctor>();

foreach (Doctor d in doctors)

if (d.City == "Chicago")

inchicago.Add(d);

A more elegant approach currently available in C# 2.0 is to take advantage of the collection's FindAll method, which can be passed a delegate to a function that determines whether the doctor resides in Chicago:

public delegate bool Predicate<T>(T obj); // pre-defined in .NET 2.0

public bool IsInChicago(Doctor obj) // notice how this matches delegate signature

{

return obj.City == "Chicago";

}

List<Doctor> inchicago = doctors.FindAll(new Predicate<Doctor>(this.IsInChicago));

FindAll iterates through the collection, building a new collection containing those objects for which the delegate-invoked function returns true.

A more succinct version passes an anonymous method to FindAll:

List<Doctor> inchicago = doctors.FindAll(delegate(Doctor d)

{

return d.City == "Chicago";

} );

The {} denote the body of the anonymous method; notice these fall within the scope of the () in the call to FindAll. The signature of the anonymous methodin this case a Boolean function with a single Doctor argumentis type-checked by the compiler to ensure that it matches the definition of the argument to FindAll. The compiler then translates this version into an explicit delegate-based version, assigning a unique name to the underlying method:

private static bool b__0(Doctor d)

{

return d.City == "Chicago";

}

List<Doctor> inchicago = doctors.FindAll( new Predicate<Doctor>(b__0) );

While this has nothing to do with LINQ per se, this approach of translating from one abstraction to another exemplifies how LINQ integrates into C# 3.0.

4.1. Lambda Expressions

From the developer's perspective, lambda expressions in C# 3.0 are a straightforward simplification of anonymous methods. Whereas an anonymous method is an unnamed block of code, a lambda expression is an unnamed expression that evaluates to a single value. Given a value x and an expression f(x) to evaluate, the corresponding lambda expression is written

x => f(x)

For example, in C# 3.0 the previous call to FindAll with an anonymous method is significantly shortened by using a lambda expression:

List<Doctor> inchicago = doctors.FindAll(x => x.City == "Chicago");

In this case, x is a doctor, and f(x) is the expression x.City == "Chicago", which evaluates to true or false. For readability, let's substitute d for x:

List<Doctor> inchicago = doctors.FindAll(d => d.City == "Chicago");

Notice that this lambda expression matches the parameter and body of the anonymous method we saw earlier, minus the syntactic extras like {}.

Lambda expressions are equivalent to anonymous methods that return a value when invoked, and are thus interchangeable. In fact, the compiler translates lambda expressions into delegate-based code, exactly as it does for anonymous methods.

4.2. Type Inference

Interestingly, there is one very subtle difference between lambda expressions and anonymous methods: the latter require type information, while the former do not. Returning to our example, notice that the lambda expression

d => d.City == "Chicago"

does not specify a type for d. Without a type, the compiler cannot translate the lambda expression into the equivalent anonymous method:

delegate(?????? d) // what type is the argument d?

{

return d.City == "Chicago";

}

To make this work, C# 3.0 is actually inferring the type of d in the lambda expression, based on contextual information. For example, since doctors is of type List<Doctor>, the compiler can prove that in the context of calling FindAll

doctors.FindAll(d => d.City == "Chicago")

d must be of type Doctor.

Type inference is used throughout C# 3.0 to make LINQ more convenient, without any loss of safety or performance. The idea of type inference is that the compiler infers the types of your variables based on context, not programmer-supplied declarations. The compiler does this while continuing to enforce strict, compile-time type checking.

As we saw earlier when introducing LINQ, developers can take advantage of type inference by using the var keyword when declaring local variables. For example, the declarations

var sum = 0;

var avg = 0.0;

var obj = new Doctor(...);

trigger the inference of int for sum, double for avg, and Doctor for obj. The initializer in the declaration is used to drive the inference engine, and is required. The following are thus illegal:

var obj2; // ERROR: must have an initializer

var obj3 = null; // ERROR: must have a specific type

Assuming the inference is successful, the inferred type becomes permanent and compilation proceeds as usual. Apparent misuses of the type are detected and reported by the compiler:

var obj4 = "hi!";

obj4.Close(); // Oops, wrong object! (ERROR: 'string' does not contain 'Close').

Do not confuse var with the concept of a VB variant (it's not), nor with the concept of var in dynamic languages like JavaScript (where var really means object). In these languages the variable's type can change, and so type checking is performed at runtimeincreased flexibility at the cost of safety. In C# 3.0 the type cannot change, and all type checking is done at compile-time. For example, if the inferred type is object (as for obj6 below), in C# 3.0 you end up with an object reference of very little functionality:

object obj5 = "hi"; // obj5 references the string "hi!", but type is object

var obj6 = obj5; // obj6 also references "hi!", with inferred type object

string s1 = obj6.ToUpper(); // ERROR: 'object' does not contain 'ToUpper'

While developers will find type inference useful in isolation, the real motivation is LINQ. Type inference is critical to the success of LINQ since queries can yield complex results. LINQ would be far less attractive if developers had to explicitly type all aspects of their queries. In fact, specifying a type is sometimes impossible, e.g., when projections select new patterns of data:

var query = from d in doctors

where d.City == "Chicago"

select new { d.GivenFirstName, d.FamilyLastName }; // type?

foreach (var r in query)

System.Console.WriteLine("{0}, {1}", r.FamilyLastName, r.GivenFirstName);

In these cases, typing is better left to the compiler.

4.3. Anonymous Types and Object Initializers

Consider the previous query to find the names of all doctors living in Chicago:

var query = from d in doctors

where d.City == "Chicago"

select new { d.GivenFirstName, d.FamilyLastName };

The query interacts with complete Doctor objects from the doctors collection, but then projects only the doctor's first and last name. Look carefully at the syntax. We see the new keyword without a class name, and { } instead of ( ):

new { d.GivenFirstName, d.FamilyLastName }

The new operator is in fact instantiating an object, but the class is an anonymous typethe C# 3.0 compiler is automatically creating an appropriate class definition with a unique name and substituting that name. The only constructor provided by the generated class is a default one (i.e. parameter-less), which favors serialization but complicates the issue of object initialization. This is the motivation for the { }, which denote C# 3.0's new object initializer syntax. Object initializers provide an inline mechanism for initializing objects. And in the case of anonymous types, object initializers define both the data members of the type and their initial values.

For example, our earlier object instantiation is translated to

new <Projection>f__12 { d.GivenFirstName, d.FamilyLastName }

where the compiler has generated

public sealed class <Projection>f__12

{

private string _GivenFirstName;

private string _FamilyLastName;

public string GivenFirstName

{ get { return _GivenFirstName; } set { _GivenFirstName = value; } }

public string FamilyLastName

{ get { return _FamilyLastName; } set { _FamilyLastName = value; } }

public <Projection>f__12( ) // default constructor: empty

{ }

public override string ToString() // dumps contents of all fields

{ ... }

public override bool Equals(object obj) // all fields must be Equals

{ ... }

public override int GetHashCode()

{ ... }

}

As you can see, the names and types of the properties are inferred from the initializers. If you prefer, you can assign your own names to the data members of the anonymous type. For example:

new { First = d.GivenFirstName, Last = d.FamilyLastName }

In this case the compiler generates:

public sealed class <Projection>f__13

{

private string _First;

private string _Last;

public string First

{ get { return _First; } set { _First = value; } }

public string Last

{ get { return _Last; } set { _Last = value; } }

}

Anonymous types are being added to C# 3.0 primarily to support of LINQ, allowing developers to work with just the data they need in a familiar, object-oriented package. And courtesy of type inference, you can work with anonymous types in a safe, type-checked manner.

A current limitation of anonymous types is that they cannot be accessed outside the defining assemblywhat name would you use to refer to the class? In theory you could compile the code in a separate assembly and then use the generated name, but this is an obvious bad practice (and currently prevented by the fact that the generated name is invalid at the source code level). At issue is the design of N-tier applications, whose tiers communicate by passing data. If the data is packaged as an anonymous type, how do the other tiers refer to it? Since they cannot, anonymous types should be viewed as a technology for local use only.

4.4. Query Expressions

A LINQ query is called a query expression. Query expressions start with the keyword from, and are written using SQL-like query operators such as Select, Where, and OrderBy:

using System.Query; // import standard LINQ query operators

// all doctors living in Chicago, sorted by last name, first name:

var chicago = from d in doctors

where d.City == "Chicago"

orderby d.FamilyLastName, d.GivenFirstName

select d;

foreach (var r in chicago)

System.Console.WriteLine("{0}, {1}", r.FamilyLastName, r.GivenFirstName);

By default, you must import the namespace System.Query to gain access to the standard LINQ query operators.

As noted earlier, type inference is used to make query expressions easier to write and consume:

var chicago = from d in doctors ... select d;

foreach (var r in chicago) ... ;

But what exactly is a query expression? What type is inferred for the variable chicago above? Consider SQL. In SQL, a select query is a declarative statement that operates on one or more tables, producing a table. In LINQ, a query expression is a declarative expression operating on one or more IEnumerable objects, returning an IEnumerable object. Thus, a query expression is an expression of iteration across one or more objects, producing an object over which you iterate to collect the result. For example, let's be type-specific in our previous declaration:

IEnumerable<Doctor> chicago = from d in doctors ... select d;

LINQ defines query expressions in terms of IEnumerable<T> to hide implementation details (preserving flexibility!) while conveying in a strongly-typed way the key concept that a query expression can be iterated across:

foreach (Doctor d in chicago) ... ;

Of course, type inference conveniently hides this level of detail without any loss of safety or performance.

4.5. Extension Methods

So how exactly are query expressions executed? Query expressions are translated into traditional object-oriented method calls by way of extension methods. Extension methods, new in C# 3.0, extend a class without actually being members of that class. For example, consider the following query expression:

using System.Query;

// initials of all doctors living in Chicago, in no particular order:

var chicago = from d in doctors

where d.City == "Chicago"

select d.Initials;

Using extension methods, this query can be rewritten as follows:

using System.Query;

var chicago = doctors.

Where(d => d.City == "Chicago").

Select(d => d.Initials);

This statement calls two methods, Where and Select, passing them lambda expressionsyet these methods are not members of the Doctors class. How and why does this compile?

In C# 3.0, extension methods are defined as static methods in a static class, annotated with the System.Runtime.CompilerServices.Extension attribute. By importing the namespace that includes this static class, the compiler treats the extension methods as if they were instance methods. For example, the LINQ standard query operators are defined as extension methods in the class System.Query.Sequence. By importing the namespace System.Query, we gain access to the query operators as if they were members of the Doctors class.

Let's look at an extension method in more detail. Here's the signature for System.Query.Sequence.Where:

namespace System.Query

{

public delegate ReturnT Func<ArgT, ReturnT>(ArgT arg);

public static class Sequence

{

public static IEnumerable<T> Where<T>(this IEnumerable<T> source,

Func<T, bool> predicate)

{ ... }

}

The extension method Where returns an IEnumerable object by iterating across an existing IEnumerable object (source), applying a delegate-based Boolean function (predicate) to determine membership in the result set. Notice that both the first parameter and the return value are defined in terms of IEnumerable<T>, establishing the link between query expressions and the extension methods that drive them.

Given a query expression or a query written using extension methods, the C# 3.0 compiler translates these into calls to the underlying static methods. For example, either form of our query above is conceptually translated into the following:

var temp = System.Query.Sequence.Where(doctors, d => d.City == "Chicago");

var chicago = System.Query.Sequence.Select(temp, d => d.Initials);

In reality, this form is skipped and the translation proceeds directly to C# 2.0-compatible code:

IEnumerable<Doctor> temp; // doctors living in chicago

IEnumerable<string> chicago; // initials of doctors living in chicago

temp = System.Query.Sequence.Where<Doctor>( doctors,

new Func<Doctor, bool>(b__0) );

chicago = System.Query.Sequence.Select<Doctor, string>( temp,

new Func<Doctor, string>(b__3) );

with the lambda expressions translated into delegate-invoked methods:

private static bool b__0(Doctor d) // lambda: d => d.City == "Chicago"

{ return d.City == "Chicago"; }

private static string b__3(Doctor d) // lambda: d => d.Initials

{ return d.Initials; }

Notice the important role that type inference plays during this translation. Extension methods and their supporting types (IEnumerable, Func, etc.) are elegantly defined once using generics. Type inference is relied on to determine the type T involved in each aspect of the query, and to then qualify the generic appropriately. In this case we see the inference of both Doctor and string.

In C# 3.0, what differentiates an extension method from an ordinary static method? Observe that a query based on extension methods is converted to a static version by passing the object instance as the first argument:

doctors.Where(...) ==> System.Query.Sequence.Where(doctors, ...)

This is identical to the standard mechanism for calling instance methods, where the first parameter is a reference to the object itselfproviding a value for this. This logic explains C#'s choice of the this keyword in the signature of extension methods:

public static IEnumerable<T> Where<T>(this IEnumerable<T> source, ... )

In C# 3.0 it is the presence of the this keyword on the first parameter that identifies a static method as an extension method.

What if the class (such as Doctors) or one of its base classes contain methods that conflict with imported extension methods? To avoid unexpected behavior, all other methods take priorityextension methods have the lowest precedence when the compiler performs name resolution, and thus become candidates only after all other possibilities have been exhausted. If two or more imported namespaces yield candidate extension methods, the compiler reports the conflict as a compilation error. 4.6. Lazy Evaluation

Interestingly, you may be a bit surprised by the behavior of queries in LINQ. Let's look at an example. First, consider the following Boolean function that as a side-effect outputs the doctor's initials:

public static bool dump(Doctor d)

{

System.Console.WriteLine(d.Initials);

return true;

}

Now let's use dump in a simple query that will end up selecting all the doctors because the function always returns true:

var query = from d in doctors

where dump(d)

select d;

Here's the million dollar question: what does this query output? [ Nothing! ]

It turns out that the declaration of a query expression does just thatdeclares a query, but does not evaluate it. Evaluation in most cases is delayed until the results of the query are actually requested, an approach known as lazy evaluation. For example, the following statement outputs the initials of the first doctor:

query.GetEnumerator().MoveNext(); // outputs ==> "mbl"

By "moving" to the first element in the result set, we trigger a call to dump based on the first doctor, which outputs the doctor's initials. Since dump returns true, this doctor satisfies the where condition, the doctor is considered part of the result set, and MoveNext returns because it has successfully moved to the first element in the result set. Hence the initials of the first doctor, and only the first doctor, are output. [ What do you think happens if dump returns false for the first doctor? ]

To output the initials of all the doctors, we iterate across the entire result:

foreach (var result in query) // "mbl", "jl", "ch", ...

;

Again, it's sufficient to simply request each result in order to trigger evaluation (and in this case the output of the doctor's initials), we do not have to access the resulting value.

You are probably wondering how lazy evaluation is supported in C# 3.0. In fact, the support was introduced in C# 2.0 via the yield construct. For example, here's the complete implementation of the standard LINQ query operator Where that we discussed earlier:

public static class Sequence

{

public static IEnumerable<T> Where<T>(this IEnumerable<T> source,

Func<T, bool> predicate)

{

foreach (T element in source)

if (predicate(element))

yield return element;

}

The yield return pattern produces the next value of the iteration and then returns. In response to yield, the C# compiler generates the necessary code (essentially a nested iterator class) to implement IEnumerable and enable iteration to continue where it left off.

NOTE

To learn more about how things work and what the C# compiler is doing, I highly recommend Lutz Roeder's Reflector tool as a way to reverse-engineer your compiled code and see what's going on: http://www.aisto.com/roeder/dotnet.

One of the implications of LINQ's lazy evaluation is that query expressions are re-evaluated each time they are processed. In other words, the results of a query are not stored or cached in a collection, but lazily evaluated and returned on each request. The advantage is that changes in the data are automatically reflected in the next processing cycle:

foreach (var result in query) // outputs initials for all doctors

;

doctors.Add( new Doctor("xyz", ...) );

foreach (var result in query) // output now includes new doctor "xyz"

;

Another advantage is that complex queries can be constructed from simpler ones, and execution is delayed until the final query is ready (and possibly optimized!).

If you want to force immediate query evaluation and cache the results, the simplest approach is to call the ToArray or ToList methods on the query. For example, the following code fragment outputs the initials of all doctors three consecutive times:

var cache1 = query.ToArray(); // evaluate query & cache results as an array

doctors.Add( new Doctor("xyz", ...) );

System.Console.WriteLine("###");

var cache2 = query.ToList(); // evaluate query again & cache results as a list

System.Console.WriteLine("###");

System.Console.WriteLine( cache1.GetType() );

System.Console.WriteLine( cache2.GetType() );

foreach (var result in cache1) // output results of initial query:

System.Console.WriteLine(result.Initials);

System.Console.WriteLine("###");

The second cache contains the new doctor, but the first does not. Here's the output:

mbl

###

mbl

xyz

###

Doctor[]

System.Collections.Generic.List`1[Doctor]

mbl

###

Most (but not all) of the standard LINQ query operators are lazily evaluated. The exception are operators that return a single (scalar) value, such as Min and Max, which might as well produce the value instead of a collection containing that value:

int minPager = doctors.Min(d => d.PagerNumber);

int maxPager = doctors.Max(d => d.PagerNumber);

System.Console.WriteLine("Pager range: {0}..{1}", minPager, maxPager);

The section on standard LINQ query operators will make note of which operators are lazily evaluated, and which are not. But first, let's take a deeper look at what LINQ offers.

DotNet First Side

Wednesday, February 13, 2008

Supporting LINQ in C# 3.0

4.1. Lambda Expressions

4.2. Type Inference

4.3. Anonymous Types and Object Initializers

4.4. Query Expressions

4.5. Extension Methods

No comments:

Post a Comment

Dotnet-Interviews

Blog Archive

Blog Roll