I'm using LINQ to compare two DataSets with each other to create new rows and update existing. I've noticed that the complete comparison lasts ~1,5 hours and only one of the two cores is busy(Task-Manager is 50-52% CPU Usage). I must admit that I'm completely new to parallel LINQ, but I assume that it could increase performance significantly.
So my question is, how and what should I parallelize?
These are the original queries(reduced to the essentials):
'check for new data Dim srcUnique = From row In src.Email_Total Select Ticket_ID = row.ticket_id, Interaction = row.interaction, ModifiedAt = row.modified_time Dim destUnique = From row In dest.ContactDetail Where row.ContactRow.fiContactType = emailContactType.idContactType Select row.ContactRow.Ticket_ID, row.Interaction, row.ModifiedAt 'get all emails(contactdetails) that are in source but not in destination Dim diffRows = srcUnique.Except(destUnique).ToList 'get all new emails(according to ticket_id) for calculating contact columns Dim newRowsTickets = (From row In src.Email_Total Join d In diffRows On row.ticket_id Equals d.Ticket_ID _ And row.interaction Equals d.Interaction _ And row.modified_time Equals d.ModifiedAt Group row By Ticket_ID = row.ticket_id Into NewTicketRows = Group).ToList For Each ticket In newRowsTickets Dim contact = dest.Contact.FindByTicket_IDfiContactType(ticket.Ticket_ID, emailContactType.idContactType) If contact Is Nothing Then ' Create new Contact with many sub-queries on this ticket(omitted) ****' Dim newContact = Me.dest.Contact.NewContactRow dest.Contact.AddContactRow(newContact) contact = newContact Else ' Update Contact with many sub-queries on this ticket(omitted) ' End If daContact.Update(dest.Contact) ' Add new ContactDetail-Rows from this Ticket(this is the counterpart of the src.Email_Total-Rows, details omitted) ' For Each newRow In ticket.NewTicketRows Dim newContactDetail = dest.ContactDetail.NewContactDetailRow newContactDetail.ContactRow = contact dest.ContactDetail.AddContactDetailRow(newContactDetail) Next daContactDetails.Update(dest.ContactDetail) Next
DataTables, where every ContactDetail belongs to a Contact.
Even if not both cores would use 100% CPU, I assume that it would increase performance significantly if I would parallelize the queries, because the second core is nearly idle. The
for each might also be a good place to optimize since the tickets are not related to each other. So I assume that I could loop with multiple threads and create/update records parallel. But how to do it with PLINQ?
Side Note: As I've mentioned in the comments, performance is not a key factor for me so far, since the server's only purpose is to synchronize the MySQL Database(on another server) with a MS SQL-Server(on the same server as this Windows-Service). It acts as a source for reports that are generated by another service. But these reports are only generated once a day. But apart from that I was interested in learning PLINQ because I thought that this could be an excellent exercise. It takes the mentioned 1,5h only if destination DB is empty and all records must be created. If both databases are nearly in sync, this method takes only ~1 minute yet. In future performance will become more important since email is only one of several contact-types(chat+calls will exceed 1mil.records). I think that I'll anyway need some kind of (LINQ) Data-Paging then.
If something is unclear I'll update my answer accordingly. Thanks in advance.
Edit: Here is the result of my investigations and attempts:
Question: How to "PLINQ" an existing LINQ query with joins?
Answer: Note that some LINQ operators are binary—they take two IEnumerables as input. Join is a perfect example of such an operator. In these cases, the type of the left-most data source determines whether LINQ or PLINQ is used. Thus you need only call AsParallel on the first data source for your query to run in parallel:
IEnumerable<T> leftData = ..., rightData = ...; var q = from x in leftData.AsParallel() join y in rightData on x.a == y.b select f(x, y);
But if I change my query in the following way(note the
Dim newRowsTickets = (From row In src.Email_Total.AsParallel() Join d In diffRows On row.ticket_id Equals d.Ticket_ID _ And row.interaction Equals d.Interaction _ And row.modified_time Equals d.ModifiedAt Group row By Ticket_ID = row.ticket_id Into NewTicketRows = Group).ToList
The compiler will complain that I need to add
AsParallel to the right datasource as well. So this seem to be a VB.NET issue or a lack of documentation(article is from 2007). I assume the latter because the(apart from that recommendable) article also says that you need to add
System.Concurrency.dll manually but actually it is part of .NET 4.0 Framework and in Namespace
I realized that I won't profit from a parallelized
Except since the query is fast enough in sequential mode(even with nearly the same number of rows in both collection which results in the maximum number of comparisons, I got the result in less than 30 seconds). But I will add it for the sake of completeness later.
So I decided to parallelize the
for-each what is as easy as with LINQ-Queries, you simply need to add
AsParallel() at the end.
But I realized that I need to force the parallelism with
WithExecutionMode(ParallelExecutionMode.ForceParallelism), otherwise .NET decides to use only one core for this loop. I also wanted to tell .NET that I wish to use as many Threads as possible but not more than 8:
Now both cores are working at the same time, but the CPU usage stays on 54%.
So this is the PLINQ version so far:
Dim diffRows = srcUnique.AsParallel.Except(destUnique.AsParallel).ToList Dim newRowsTickets = (From row In src.Email_Total.AsParallel() Join d In diffRows.AsParallel() On row.ticket_id Equals d.Ticket_ID _ And row.interaction Equals d.Interaction _ And row.modified_time Equals d.ModifiedAt Group row By Ticket_ID = row.ticket_id Into NewTicketRows = Group).ToList For Each ticket In newRowsTickets. AsParallel(). WithDegreeOfParallelism(8). WithExecutionMode(ParallelExecutionMode.ForceParallelism) ' blah,blah ... ' 'add new ContactDetails for this Ticket(only new rows) For Each newRow In ticket.NewTicketRows. AsParallel(). WithExecutionMode(ParallelExecutionMode.Default) ' blah,blah ... ' Next daContactDetails.Update(dest.ContactDetail) Next
Unfortunately I don't see any performance benefits from using
AsParallel in comparison with sequential mode:
for each with
09/29/2011 18:54:36: Contacts/ContactDetails created or modified. Duration: 01:21:34.40
09/29/2011 16:02:55: Contacts/ContactDetails created or modified. Duration: 01:21:24.50
Can somebody explain me this result? Is the database' write access in the
for each responsible for the similar time?
Following are recommendable readings:
- http://msdn.microsoft.com/en-us/magazine/cc163329.aspx (not up-to-date)
- Overview of videos/articles according this topic
- PDF: When To Use ParallelForEach Or PLINQ
- When PLINQ will speedup performance