Inside ADO.NET Batch Update
Dino Esposito
Wintellect
November 8, 2001
The interaction between ADO.NET applications and the underlying data sources is based on a dual architecture with two-way channels. You access a data source to read and write rows using either individual and provider-specific commands or batch update procedures. In both cases, the data access results in a complete two-way binding and involves different objects and methods. You use command classes like SqlCommand and OleDbCommand to execute single commands. You would use data adapter objects to download disconnected data and submit sets of updated rows. Individual commands return data through data reader objects, whereas the DataSet is the container object that the data adapter utilizes to return and submit blocks of records.
Updates accomplished through individual commands, stored procedures, and in general any command text the managed provider understands, are normally referred to as update. An update command always carries new data out embedded in the body of the statement. The update command always requires an open connection, and may also require an ongoing or a new transaction. The batch update is the offshoot of a slightly different approach. At the highest level of abstraction, you don't issue a command, no matter how complex it could be. Instead, you submit a snapshot of the current rows as modified on the client and wait for the data source approval. The key concept behind batch update is the concept of data disconnection. You download a table of rows, typically a DataSet, modify it as needed on the client, and then submit the new image of those rows to the database server. You submit changes rather than executing a command that will create changes to the data source. This is the essential difference between update, which I covered in my July column, and batch update.
The figure below illustrates the dual update architecture of ADO.NET.
Figure 1. The dual two-way interaction between ADO.NET applications and the data source
Before going any further with the details of ADO.NET batch update, I'd like to clarify one aspect of the batch update model that often leads to some misunderstanding. Although update and batch update are philosophically different, in terms of the actual implementation within ADO.NET, they follow the same update model. Both update and batch update are accomplished through direct and provider-specific statements. Of course, since batch update normally involves more rows, the statements are grouped into a batch call. Batch update loops through the rows of the target DataSet and issue the proper update command (INSERT, DELETE, or UPDATE) whenever an updated row is found. In correspondence of an updated row, a predefined direct SQL command is run. In essence, this is batch update.
This comes as no surprise. In fact, if batch update were using a completely different model of update, then a special support would have been required from the data source. (This is what happens when you submit XML updategrams to SQL Server 2000.) Batch update is just a client-provided software mechanism to simplify the submission of multiple row updates. In any case, each new row submission is always made through the normal channels of data source direct commands.
So far I've only hinted at SQL commands, but these hints are a sure sign of an important difference between the ADO and the ADO.NET batch update implementation. In ADO, batch update was only possible for SQL-based data sources. In ADO.NET, instead, batch update is possible for any kind of managed provider, including those that should not be exposing their data through the SQL query language. That said, it's about time I start reviewing the key aspects of ADO.NET batch update programming.
Preparing the DataSet for Submission
ADO.NET batch update takes place through the Update method of the data adapter object. Data can be submitted only on a per-table basis. If you call Update without specifying a table name, a default name of Table is assumed. If no table exists with that name, an exception is raised. Update first examines the RowState property of each table row and then prepares a tailor-made INSERT, UPDATE, or DELETE statement for each inserted, updated, or deleted row in the specified table.
The Update method has several overloads. It can take a pair given by the DataSet and the DataTable, a DataTable, or even an array of DataRow objects. The method returns an integer value being the number of rows successfully updated.
To minimize the network traffic, you normally invoke Update on a subset of the DataSet you are working on. Needless to say, this subset contains only the rows that have been modified in the meantime. You get such a subset by calling the DataSet's GetChanges method.
if (ds.HasChanges())
{
DataSet dsChanges = ds.GetChanges();
adapter.Update(dsChanges, "MyTable");
}
You check the DataSet for changes using the HasChanges method instead. HasChanges returns a Boolean value.
The DataSet returned by GetChanges contains the rows that have been inserted, deleted, or modified in the meantime. But in the meantime of what? This is a tricky aspect of ADO.NET batch update and has to do with the current state of a table row.
States of a Row
Each row in a DataTable is rendered through a DataRow object. A DataRow object mainly exists to be an element of the Rows collection of a parent DataTable object. Conceptually, a database row is inherently linked to the structure of a given table. Just for this case, the DataRow class in ADO.NET does not provide a public constructor. The only way to create a new DataRow object is by means of a method called NewRow, on a particular living instance of a DataTable object. Upon creation, a row does not yet belong to the Rows collection of the parent table, but its relationship with this collection determines the state of the row. The following table shows the feasible values for the RowState property. Those values are grouped in the DataRowState enumeration.
Row State |
Description |
Added |
The row has been added to the table. |
Deleted |
The row has been marked for deletion from the parent table. |
Detached |
Either the row has been created but not added to the table, or the row has been removed from the collection of table rows. |
Modified |
Some columns within the row have been changed. |
Unchanged |
No changes made to the row since creation or since the last call to the AcceptChanges method. |
The RowState property of each row influences the return value of the HasChanges method and the contents of the child DataSet returned by GetChanges.
From the range of the feasible values, it turns out that the value of RowState mostly depends on the operation that has been performed on the row. ADO.NET tables implement a transaction-like commit model based on two methods—AcceptChanges and RejectChanges. When the table is downloaded from the data source, or freshly created in memory, all the rows are unchanged. All the changes you enter are not immediately persistent and can be rolled back at any time by calling RejectChanges. You can call the RejectChanges method at three levels:
- On the DataSet to reject all changes whatsoever.
- On the DataTable to cancel all the changes in the particular table.
- On a particular row to restore its previous state.
The method AcceptChanges has the power to commit all the ongoing changes. It makes the DataSet accept the current values as the new original values. As a result, all the pending changes are cleared up. Just as RejectChanges, AcceptChanges can also be called on the whole DataSet, on a particular table, or an individual row.
When you start a batch update operation, only the rows marked as Added, Deleted, and Modified are taken into account for submission. If you happen to call AcceptChanges prior to batch-update, no change will be persisted to the data source.
On the other hand, once the batch update operation has successfully completed, you must call AcceptChanges to clear pending changes and mark the current DataSet values as the original values. Notice that omitting a final call to AcceptChanges would maintain pending changes in the DataSet with the result to have them re-issued next time you batch update.
// Get changes in the DataSet
dsChanges = ds.GetChanges();
// Performs the batch update for the given table
da.Update(dsChanges, strTable);
// Clears any pending change in memory
ds.AcceptChanges();
The code above illustrates the three main steps behind ADO.NET batch update.
If you delete a row from a DataSet table, pay attention to the method you use—Delete or Remove. The Delete method performs a logical deletion by marking the row as Deleted. The Remove method, instead, physically removes the row from the Rows collection. As a result, a row deleted through Remove is not marked for deletion and subsequently not processed during batch update. If the ultimate goal of your deletion is removing the row from the data source, then use Delete.
Update Internals
Three operations can modify the state of a table:
- Insertion of a new row
- Deletion of an existing row
- Update of an existing row
For each of these key operations, the data adapter defines a tailor-made command object that is exposed as a property. Such properties include InsertCommand, DeleteCommand, and UpdateCommand. The programmer is responsible to assign these properties meaningful command objects—for example, SqlCommand objects.
Just the availability of InsertCommand, DeleteCommand, and UpdateCommand properties represents a quantum leap from ADO. Such properties give you unprecedented control over the way in which in-memory updates are submitted to the database server. If you happen to dislike the update code that ADO.NET generates, you can now modify it without renouncing the overall feature of batch update. With ADO you had no control on the SQL commands silently generated by the library. In ADO.NET, instead, publicly exposed command objects allow you to apply updates using made-to-measure stored procedures or SQL statements that better match your user expectations. In particular, you can have the batch update system work with cross-referenced tables and even target non-SQL data providers like Active Directory™ or Indexing Services.
The update commands are expected to run for each changed row in the table and have to be general enough to accommodate different values. Command parameters would be good at this kind of task as long as you can you bind them to the values of a database column. ADO.NET parameter objects expose two properties, like SourceColumn and SourceVersion, which provide for this type of binding. SourceColumn, in particular, represents an indirect way to indicate the parameter's value. Instead of using the Value property and setting it with a scalar value, you could set the SourceColumn property with a column name and have the batch update mechanism to extract the effective value from time to time.
SourceVersion indicates which value should be read on the column. By default, ADO.NET returns the current value of the row. As an alternative, you could select the original value and all the values found in the DataRowVersion enumeration.
If you want to batch update a couple of columns on the Northwind's Employees table, you can use the following, handcrafted commands. The INSERT command is defined as follows:
StringBuilder sb = new StringBuilder("");
sb.Append("INSERT Employees (firstname, lastname) VALUES(");
sb.Append("@sFirstName, @sLastName)");
da.InsertCommand = new SqlCommand();
da.InsertCommand.CommandText = sb.ToString();
da.InsertCommand.Connection = conn;
All the parameters will be added to the data adapter's Parameters collection and bound to a DataTable column.
SqlParameter p1 = new SqlParameter("@sFirstName", SqlDbType.NVarChar, 10);
p1.SourceVersion = DataRowVersion.Current;
p1.SourceColumn = "firstname";
da.InsertCommand.Parameters.Add(p1);
SqlParameter p2 = new SqlParameter("@sLastName", SqlDbType.NVarChar, 30);
p2.SourceVersion = DataRowVersion.Current;
p2.SourceColumn = "lastname";
da.InsertCommand.Parameters.Add(p2);
Notice that auto-increment columns should not be listed in the syntax of the INSERT command as their value is being generated by the data source.
The UPDATE command needs to identify one particular row to apply its changes. You do this using a WHERE clause in which a parameterized value is compared against a key field. In this case, the parameter used in the WHERE clause must be bound to the original value of the row, instead of the current value.
StringBuilder sb = new StringBuilder("");
sb.Append("UPDATE Employees SET ");
sb.Append("lastname=@sLastName, firstname=@sFirstName ");
sb.Append("WHERE employeeid=@nEmpID");
da.UpdateCommand = new SqlCommand();
da.UpdateCommand.CommandText = sb.ToString();
da.UpdateCommand.Connection = conn;
// p1 and p2 set as before
:
p3 = new SqlParameter("@nEmpID", SqlDbType.Int);
p3.SourceVersion = DataRowVersion.Original;
p3.SourceColumn = "employeeid";
da.UpdateCommand.Parameters.Add(p3);
Finally, the DELETE command requires a WHERE clause to identify the row to remove. In this case, you need to use the original version of the row to bind the parameter value.
StringBuilder sb = new StringBuilder("");
sb.Append("DELETE FROM Employees ");
sb.Append("WHERE employeeid=@nEmpID");
da.DeleteCommand = new SqlCommand();
da.DeleteCommand.CommandText = sb.ToString();
da.DeleteCommand.Connection = conn;
p1 = new SqlParameter("@nEmpID", SqlDbType.Int);
p1.SourceVersion = DataRowVersion.Original;
p1.SourceColumn = "employeeid";
da.DeleteCommand.Parameters.Add(p1);
The actual structure of the SQL commands is up to you. They don't need to be plain SQL statements, and can be more effective stored procedures if you want to go that direction. If there's a concrete risk that someone else could have updated the row that you read and modified, then you might want to take some more effective counter-measures. If this were the case, you could use a more restrictive WHERE clause on the DELETE and UPDATE commands. The WHERE clause could unequivocally identify the row, but also make sure that all the columns still hold the original value.
UPDATE Employees
SET field1=@new_field1, field2=@new_field2, …, fieldn=@new_fieldn
WHERE field1=@old_field1 AND
field2=@old_field2 AND
:
fieldn=@old_fieldn
Notice that you don't need to fill all command properties, but only those that you plan to use. If the code happens to use a command that has not been specified, an exception is thrown. Setting up the commands for a batch update process may require a lot of code, but you don't need to do it each and every time you batch update. In a fair number of cases, in fact, ADO.NET is capable of automatically generating effective update commands for you.
Command Builders
To utilize default commands, you have to fulfill two requirements. First off, you must assign a valid command object to the SelectCommand property. You don't need to populate other command objects, but SelectCommand must point to a valid query statement. A valid query for the batch update is a query that returns a primary key column. In addition, the query must not include INNER JOIN, calculated columns, and reference multiple tables.
The columns and the table listed in the SelectCommand object will actually be used to prepare the body of the update and insert statements. If you don't set SelectCommand, then ADO.NET command auto-generation cannot work. The following code shows how to code the SelectCommand property.
SqlCommand cmd = new SqlCommand();
cmd.CommandText = "SELECT employeeid, firstname, lastname FROM Employees";
cmd.Connection = conn;
da.SelectCommand = cmd;
Don't worry about the possible impact that SelectCommand may have on performance. The related statement executes only once prior to the batch update process, but it only retrieves column metadata. No matter how you write the SQL statement, no rows will ever be returned to the caller program. This happens because at execution time, the SelectCommand is appended to a SQL batch statement that begins with
SET FMTONLY OFF
SET NO_BROWSETABLE ON
SET FMTONLY ON
As a result, the query does not return rows, but rather column metadata information.
The second requirement your code must fulfill regards command builders. A command builder is a managed provider-specific class that works atop the data adapter object and automatically sets its InsertCommand, DeleteCommand, and UpdateCommand properties. A command builder first runs SelectCommand to collect enough information about the involved tables and columns, and then creates the update commands. The actual commands creation takes place in the command builder class constructor.
SqlCommandBuilder cb = new SqlCommandBuilder(da);
The SqlCommandBuilder class ensures that the specified data adapter can be successfully used to batch update the given data source. The SqlCommandBuilder utilizes some of the properties defined in the SelectCommand object. They are Connection, CommandTimeout, and Transaction. Whenever any of these properties is modified, you need to call the command builder's RefreshSchema method to change the structure of the generated commands of further batch updates.
You can mix together command builders and handcrafted commands. If the InsertCommand property points to a valid command object prior to calling the command builder, then the builder would generate only the code for DeleteCommand and UpdateCommand. A non-null SelectCommand property, instead, is key for command builders to work.
Typically, you use command builders because you don't want to cope with the intricacies of SQL commands. However, if you want to have a look at the source code generated by the builders, you can call methods like GetInsertCommand, GetUpdateCommand, and GetDeleteCommand.
Command builders are a provider-specific feature. So, you should not expect to find them supported by all types of managed providers. They work with SQL Server 7.0 and higher and OLE DB providers.
A nice feature of command builders is that they can detect auto-increment fields and properly tune up the code. In particular, they would take auto-increment fields out of the INSERT statement as long as they have the means to recognize certain fields as auto-increment. This can be done in two ways. For example, you could manually set the AutoIncrement property of the corresponding DataColumn object, or, better yet, have this happen automatically based on the attributes that the column has in a data source like SQL Server. To automatically inherit such an attribute, make sure you change the MissingSchemaAction property of the data adapter from the default value of Add to AddWithKey.
Conflicts Detection
The batch update mechanism is based on an optimistic vision of concurrency. Each record is not locked after being read and remains exposed to other users for reading and writing. In this scenario, a number of potentially inconsistent situations can occur. For example, a row could have been modified, or even deleted, after it was handed to your application from a SELECT statement, but before a batch update process actually changes it back to the server.
If you update data on the server that has been modified in the meantime by some other user, you may raise a data conflict. To avoid new data being overwritten, the ADO.NET command builders generate statements with a WHERE clause that works only if the current state of the data source row is consistent with what the application previously read . If such a command fails updating the row, the ADO.NET runtime throws an exception of type DBConcurrencyException.
The following code snippet demonstrates a more accurate way to execute a batch update operation with ADO.NET.
try
{
da.Update(dsChanges, "Employees");
}
catch (DBConcurrencyException dbdcex)
{
// resolve the conflict
}
The Update method of the data adapter you are using throws the exception for the first row where the update fails. At this time, the control passes back the to client application and the batch update process is stopped. However, all previously submitted changes are committed. This represents another shift from the ADO batch update model.
The DataRow object involved in the conflicted update is made available through the Row property of the DBConcurrencyException class. This DataRow object contains both the proposed and the original value of the row. It does not contain the value currently stored in the database for a given column. This value—the UnderlyingValue property of ADO—can only be retrieved with another query command.
The way in which the conflict is resolved, and the batch update possibly resumed, is strictly application-specific. If there is a situation in which your application needs to resume the update, then be aware of a subtle, yet tricky problem. Once the conflict on the row has been solved in one way or another, you still must figure out a way to accept the changes on the in-memory rows for which batch update completed successfully. If you neglect this technicality, then a new conflict will be raised for the first row that was previously and successfully updated! This will happen over and over again, heading your application straight into a joyful deadlock.
Summary
Compared to ADO, batch update in ADO.NET is more powerful and accessible. In ADO, the batch update mechanism was a sort of black box with rare chances for you to plug into it and change what you needed to do in a slightly different way. Batch update in ADO.NET is more of a low-level solution, and its implementation provides several points where you can get in and take control of the events. The trickiest part of ADO.NET batch update is conflict resolution. I heartily suggest you spend as much time as possible testing and retesting. This investment will payoff with all the time you save with command builders.
Dialog Box: Null Values in Data Tables
I fetch a DataSet out of a database and I am happy. Then I try to save this DataSet into an XML file and I am still happy. But when I read this XML file back into a DataSet, my happiness ends. This is because all the columns with a NULL value are not persisted to XML. Is there a way by which NULL values get added as empty tags to the resultant XML?
The behavior is by design and introduced with the best intentions to save a few bytes during the XML serialization process. If this happens over the network (say, within an XML Web service) the advantage might be significant.
That said, your problem has a very simple workaround. The trick is fetching the column through the ISNULL T-SQL function. Instead of using: SELECT MyColumn FROM MyTable
You should resort to: SELECT ISNULL(MyColumn, '') FROM MyTable
In this case, any NULL value for the column is automatically turned into an empty string and not ignored during the DataSet-to-XML serialization process. The neutral value does not necessarily have to be the empty string. Numeric columns can use 0 or any other logically null value you want to use. |
Dino Esposito is
Wintellect's ADO.NET expert and a trainer and consultant based in Rome, Italy. Dino is a contributing editor to MSDN Magazine and writes the Cutting Edge column. He also regularly contributes to
Developer Network Journal and MSDN News. Dino is the author of the upcoming
Building Web Solutions with ASP.NET and ADO.NET from Microsoft Press, and the cofounder of
http://www.vb2themax.com/. You can reach Dino at
dinoe@wintellect.com.