CreateRemoteThread() for function injection into another process - c#

I have a question about remote threads.I've read Mike Stall's article present here: <Link>
I would like to create a remote thread that executes a delegate in another process, just like Mike Stall does. However, he declares the delegate in the target process, obtaining a memory address for it and then he creates the remote thread from another process using that address. The code of the target process CANNOT be modified.
So, I cannot use his example, unless I can allocate memory in the target process and then WriteProcessMemory() using my delegate.
I have tried using VirtualAllocEx() to allocate space in the target process but it always returns 0.
This is how it looks so far.
Console.WriteLine("Pid {0}:Started Child process", pid);
uint pidTarget= uint.Parse(args[0]);
IntPtr targetPid= new IntPtr(pidTarget);
// Create delegate I would like to call.
ThreadProc proc = new ThreadProc(MyThreadProc);
Console.WriteLine("Delegate created");
IntPtr fproc = Marshal.GetFunctionPointerForDelegate(proc);
Console.WriteLine("Fproc:"+fproc);
uint allocSize = 512;
Console.WriteLine("AllocSize:" + allocSize.ToString());
IntPtr hProcess = OpenProcess(PROCESS_ALL_ACCESS, false, pidParent);
Console.WriteLine("Process Opened: " + hProcess.ToString());
IntPtr allocatedPtr = VirtualAllocEx(targetPid, IntPtr.Zero, allocSize, AllocationType.Commit, MemoryProtection.ExecuteReadWrite);
Console.WriteLine("AllocatedPtr: " + allocatedPtr.ToString());
Now my questions are:
In the code above, why does VirtualAllocEx() not work? It has been imported using DLLImport from Kernel32. The allocatedPtr is always 0.
How can I calculate alloc size? Is there a way I can see how much space the delegate might need or should I just leave it as a large constant?
How do I call WriteMemory() after all of this to get my delegate in the other process?
Thank you in advance.

That blog post is of very questionable value. It is impossible to make this work in the general case. It only works because:
the CLR is known to be available
the address of the method to execute is known
it doesn't require injecting a DLL in the target process
Windows security is unlikely to stop this particular approach
Which it achieves by handing the client process everything it needs get that thread started. The far more typical usage of CreateRemoteThread is to do so when the target process does not cooperate. In other words, you don't have the CLR, you have to inject a DLL with the code, that code can't be managed, you have to deal with the DLL getting relocated and Windows will balk at all this.
Anyhoo, addressing your question: you don't check for any errors so you don't know what is going wrong. Make sure your [DllImport] declarations have SetLastError=true, check the return value for failure (IntPtr.Zero here) and use Marshal.GetLastWin32Error() to retrieve the error code.

Related

Can a managed process get terminated while writing to shared memory?

I have several (managed / .NET) processes communicating over a ring buffer which is held in shared memory via the MemoryMappedFile class (just memory no file mapped). I know from the SafeBuffer reference source that writing a struct to that memory is guarded by a CER (Constrained Execution Region) but what if the writing process gets abnormally terminated by the OS while doing so? Can it happen that this leads to the struct being written only partially?
struct MyStruct
{
public int A;
public int B;
public float C;
}
static void Main(string[] args)
{
var mappedFile = MemoryMappedFile.CreateOrOpen("MyName", 10224);
var accessor = mappedFile.CreateViewAccessor(0, 1024);
MyStruct myStruct;
myStruct.A = 10;
myStruct.B = 20;
myStruct.C = 42f;
// Assuming the process gets terminated during the following write operation.
// Is that even possible? If it is possible what are the guarantees
// in regards to data consistency? Transactional? Partially written?
accessor.Write(0, ref myStruct);
DoOtherStuff(); ...
}
It is hard to simulate / test whether this problem really exists since writing to memory is extremly fast. However, it would certainly lead to a severe inconsistency in my shared memory layout and would make it necessary to approach this with for example checksums or some sort of page flipping.
Update:
Looking at Line 1053 in
https://referencesource.microsoft.com/#mscorlib/system/io/unmanagedmemoryaccessor.cs,7632fe79d4a8ae4c
it basically comes down to the question whether a process is protected from abnormal termination while executing code in a CER block (having the Consistency.WillNotCorruptState flag set).
Yes a process can be stopped at any moment.
The SafeBuffer<T>.Write method finally calls into
[MethodImpl(MethodImplOptions.InternalCall)]
[ResourceExposure(ResourceScope.None)]
[ReliabilityContract(Consistency.WillNotCorruptState, Cer.Success)]
private static extern void StructureToPtrNative(/*ref T*/ TypedReference structure, byte* ptr, uint sizeofT);
which will do basically a memcpy(ptr, structure, sizeofT). Since unaligned writes are never atomic except for bytes you will run into issues if your process is terminated in the middle while writing a value.
When a process is terminated the hard way via TerminateProcess or an unhandled exception no CERs or something related is ever executed. There is no graceful managed shutdown happening in that case and your application can be stopped right in the middle of an important transaction. Your shared memory data structures will be left in an orphaned state and any locks you might have taken will return the next waiter in WaitForSingleObject WAIT_ABANDONED. That way Windows tells you that a process has died while it had taken the lock and you need to recover the changes done by the last writer.

ICorProfilerCallback2: CLR profiler does not log all Leave calls

I am trying to write a profiler that logs all .Net method calls in a process. The goal is to make it highly performant and keep let's say the last 5-10 minutes in memory (fixed buffer, cyclically overwrite old info) until the user triggers that info to be written to disk. Intended use is to track down rarely reproducing performance issues.
I started off with the SimpleCLRProfiler project from https://github.com/appneta/SimpleCLRProfiler. The profiler makes use of the ICorProfilerCallback2 callback interface of .Net profiling. I got it to compile and work in my environment (Win 8.1, .Net 4.5, VS2012). However, I noticed that sometimes Leave calls are missing for which Enter calls were logged. Example of a Console.WriteLine call (I reduced the output of DbgView to what is minimally necessary to understand):
Line 1481: Entering System.Console.WriteLine
Line 1483: Entering SyncTextWriter.WriteLine
Line 1485: Entering System.IO.TextWriter.WriteLine
Line 1537: Leaving SyncTextWriter.WriteLine
Two Entering calls don't have corresponding Leaving calls. The profiled .Net code looks like this:
Console.WriteLine("Hello, Simple Profiler!");
The relevant SimpleCLRProfiler methods are:
HRESULT CSimpleProfiler::registerGlobalCallbacks()
{
HRESULT hr = profilerInfo3->SetEnterLeaveFunctionHooks3WithInfo(
(FunctionEnter3WithInfo*)MethodEntered3,
(FunctionEnter3WithInfo*)MethodLeft3,
(FunctionEnter3WithInfo*)MethodTailcall3);
if (FAILED(hr))
Trace_f(L"Failed to register global callbacks (%s)", _com_error(hr).ErrorMessage());
return S_OK;
}
void CSimpleProfiler::OnEnterWithInfo(FunctionID functionId, COR_PRF_ELT_INFO eltInfo)
{
MethodInfo info;
HRESULT hr = info.Create(profilerInfo3, functionId);
if (FAILED(hr))
Trace_f(L"Enter() failed to create MethodInfo object (%s)", _com_error(hr).ErrorMessage());
Trace_f(L"[%p] [%d] Entering %s.%s", functionId, GetCurrentThreadId(), info.className.c_str(), info.methodName.c_str());
}
void CSimpleProfiler::OnLeaveWithInfo(FunctionID functionId, COR_PRF_ELT_INFO eltInfo)
{
MethodInfo info;
HRESULT hr = info.Create(profilerInfo3, functionId);
if (FAILED(hr))
Trace_f(L"Enter() failed to create MethodInfo object (%s)", _com_error(hr).ErrorMessage());
Trace_f(L"[%p] [%d] Leaving %s.%s", functionId, GetCurrentThreadId(), info.className.c_str(), info.methodName.c_str());
}
Does anybody have an idea, why the .Net Profiler would not perform Leave calls for all leaving methods? By the way, I checked that the OnLeaveMethod does not unexpectedly exit before any trace due to an exception or so. It doesn't.
Thanks, Christoph
Since stakx does not seem to be coming back to my question to provide an official answer (and get the credit) so I will do it for him:
As stakx had hinted at, I didn't log tail calls. In fact, I wasn't even aware of the concept so I had completely ignored that hook method (it was wired up but empty). I found a good explanation of tail calls here: David Broman's CLR Profiling API Blog: Enter, Leave, Tailcall Hooks Part 2: Tall tales of tail calls.
I quote from the link above:
Tail calling is a compiler optimization that saves execution of instructions and saves reads and writes of stack memory. When the last thing a function does is call another function (and other conditions are favorable), the compiler may consider implementing that call as a tail call, instead of a regular call.
Consider this code:
static public void Main() {
Helper();
}
static public void Helper() {
One();
Three();
}
static public void Three() {
...
}
When method Three is called, without tail call optimization, the stack will look like this.
Three
Helper
Main
With tail call optimization, the stack looks like this:
Three
Main
So before calling Three, due to the optimization, method Helper was already popped of the stack and as a result, there is one less method on the stack (less memory usage) and also some executions and memory write operations were saved.

.NET Interop call is limited to single thread?

I have the following code that uses new .NET 4.5 multi-threading functionality.
Action2 is a call to a windows API library MLang through Interop.
BlockingCollection<int> _blockingCollection= new BlockingCollection<int>();
[Test]
public void Do2TasksWithThreading()
{
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
var tasks = new List<Task>();
for (int i = 0 ; i < Environment.ProcessorCount; i++)
{
tasks.Add((Task.Factory.StartNew(() => DoAction2UsingBlockingCollection(i))));
}
for (int i = 1; i < 11; i++)
{
DoAction1(i);
_blockingCollection.Add(i);
}
_blockingCollection.CompleteAdding();
Task.WaitAll(tasks.ToArray());
stopwatch.Stop();
Console.WriteLine("Total time: " + stopwatch.ElapsedMilliseconds + "ms");
}
private void DoAction2UsingBlockingCollection(int taskIndex)
{
WriteToConsole("Started wait for Action2 Task: " + taskIndex);
int index;
while (_blockingCollection.Count > 0 || !_blockingCollection.IsAddingCompleted)
{
if (_blockingCollection.TryTake(out index, 10))
DoAction2(index);
}
WriteToConsole("Ended wait for Action2 Task: " + taskIndex);
}
private void DoAction2()
{
... Load File bytes
//Call to MLang through interop
Encoding[] detected = EncodingTool.DetectInputCodepages(bytes[], 1);
... Save results in concurrent dictionary
}
I did some testing with this code and increasing number of threads from 1 to 2 to 3, etc.. doesn't make process run any faster. It looks like the the threads are waiting for interop call to finish, which makes me think that it is using single thread for some reason.
Here is the definition of Interop method:
namespace MultiLanguage
{
using System;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Security;
[ComImport, InterfaceType((short) 1), Guid("DCCFC164-2B38-11D2-B7EC-00C04F8F5D9A")]
public interface IMultiLanguage2
[MethodImpl(MethodImplOptions.InternalCall, MethodCodeType=MethodCodeType.Runtime)]
void DetectInputCodepage([In] MLDETECTCP flags, [In] uint dwPrefWinCodePage,
[In] ref byte pSrcStr, [In, Out] ref int pcSrcSize,
[In, Out] ref DetectEncodingInfo lpEncoding,
[In, Out] ref int pnScores);
I there anything that can be done to make this use multiple threads? The only thing I noticed that would require single thread is MethodImplOptions.Synchronized, but that's not being used in this case.
The code for EncodingTools.cs was taken from here:
http://www.codeproject.com/Articles/17201/Detect-Encoding-for-In-and-Outgoing-Text
... Load File bytes
Threads can speed up your program when your machine has multiple processor cores, easy to get these days. Your program is however liable to spend a good bit of time on this invisible code, disk I/O is very slow compared to the raw processing speed of a modern processor. And you still have only a single disk, there is no concurrency at all. Threads will just wait their turn to read data from the disk.
[ComImport, InterfaceType((short) 1), Guid("DCCFC164-2B38-11D2-B7EC-00C04F8F5D9A")]
public interface IMultiLanguage2
This is a COM interface, implemented by the CMultiLanguage coclass. You can find it back in the registry with Regedit.exe, the HKEY_LOCAL_MACHINE\SOFTWARE\Classes\CLSID\{275C23E2-3747-11D0-9FEA-00AA003F8646} key contains the configuration for this coclass. Threading is not a detail left up to the client programmer in COM, a COM coclass declares what kind to threading it supports with the ThreadingModel key.
The value for CMultiLanguage is "Both". Which is good news, but it now greatly matters exactly how you created the object. If the object is created on an STA thread, the default for the main thread in a Winforms or WPF project, then COM ensures all the code stays thread-safe by marshaling interface method calls from your worker thread to the STA thread. That will cause loss of concurrency, the threads take their turn entering the single-threaded apartment.
You can only get concurrency when the object was created on an MTA thread. The kind you get from a threadpool thread or your own Thread without a call to its SetApartmentState() method. An obvious approach to ensure this is to create the CMultiLanguage object on the worker thread itself and avoid having these worker threads shared the same object.
Before you start fixing that, you first need to identify the bottleneck in the program. Focus on the file loading first and make sure you get a realistic measurement, avoid running your test program on the same set of files over and over again. That gives unrealistically good results since the file data will be read from the file system cache. Only the first test after a reboot or file system cache reset gives you a reliable measurement. The SysInternals' RamMap utility is very useful for this, use its Empty + Empty Standby List menu command before you start a test to be able to compare apples to apples.
If that shows that the file loading is the bottleneck then you are done, only improved hardware can solve that. If however you measure that IMultiLanguage2 calls then focus on the usage of the CMultiLanguage object. Without otherwise a guarantee that you can get ahead, a COM server typically provides thread-safety by taking care of the locking for you. Such hidden locking can ruin your odds for getting concurrency. The only way to get ahead then is to get the file reading in one thread to overlap with the parsing in another.
Try running nunit-console with parameter /apartment=MTA

0x80010100: System call failed" exception, ContextSwitchDeadlock

Long story short: in a C# application that works with COM inproc-server (dll), I encounter "0x80010100: System call failed" exception, and in debug mode also ContextSwitchDeadlock exception.
Now more in details:
1) C# app initializes STA, creates a COM object (registered as "Apartment"); then in subscribes to its connection-point, and begins working with the object.
2) At some stage the COM object generates a lot of events, passing as an argument a very big collection of COM objects, which are created in the same apartment.
3) The event-handler on C# side processes the above collection, occasionally calling some methods of the objects. At some stage the latter calls begin to fail with the above exceptions.
On the COM side the apartment uses a hidden window whose winproc looks like this:
typedef std::function<void(void)> Functor;
LRESULT CALLBACK WndProc(HWND hwnd, UINT msg, WPARAM wParam, LPARAM lParam)
{
switch(msg)
{
case AM_FUNCTOR:
{
Functor *f = reinterpret_cast<Functor *>(lParam);
(*f)();
delete f;
}
break;
case WM_CLOSE:
DestroyWindow(hwnd);
break;
default:
return DefWindowProc(hwnd, msg, wParam, lParam);
}
return 0;
}
The events are posted to this window from other parts of the COM server:
void post(const Functor &func)
{
Functor *f = new Functor(func);
PostMessage(hWind_, AM_FUNCTOR, 0, reinterpret_cast<LPARAM>(f));
}
The events are standard ATL CP implementations bound with the actual params, and they boil down to something like this:
pConnection->Invoke(id, IID_NULL, LOCALE_USER_DEFAULT, DISPATCH_METHOD, &params, &varResult, NULL, NULL);
In C# the handler looks like this:
private void onEvent(IMyCollection objs)
{
int len = objs.Count; // usually 10000 - 25000
foreach (IMyObj obj in objs)
{
// some of the following calls fail with 0x80010100
int id = obj.id;
string name = obj.name;
// etc...
}
}
==================
So, can the above problem happen just because the message-queue of the apartment is too loaded with the events it tries to deliver? Or the message loop should be totally blocked to cause such a behaviour?
Lets assume that the message-queue has 2 sequential events that evaluate to "onEvent" call. The first one enters C# managed code, which attempts to re-enter the unmanaged code, the same apartment. Usually, this is allowed, and we do this a lot. When, under what circumstances can it fail?
Thanks.
This ought to work even with multiple apartments provided that:
Only one of the threads responds to external events such as network traffic, timers, posted messages etc.
Other threads only service COM requests (even if they call back to the main thread during the processing).
AND
neither thread queue ever gets full, preventing COM from communicating with the thread.
Firstly:
It looks like some objects are not in the same apartment as other objects. Are you sure that all objects are being created in the STA?
What you are describing is a classic deadlock - two independent threads, each waiting on the other. That is what I would expect to occur with that design operating with the C# and COM sides on different threads.
You should be OK if all the objects are on the same thread, as well as the hidden window being on that thread, so I think you need to check that. (Obviously this includes any other objects which are created by the COM side and passed over to the C# side.)
You could try debugging this by pressing "pause" in the debugger and checking what code was in each thread (if you see RPCRT*.DLL this means you are looking at a proxy). Alternately you could DebugPrint the current thread ID from various critical points in both C# and COM sides and your WndProc - they should all be the same.
Secondly: it ought to work with multiple threads provided that only one of the threads generates work items, and the other does nothing but host COM objects which respond to calls (i.e. doesn't generate calls from timers, network traffic, posted messages etc), in this case it may be that the thread queue is full and COM cannot reply to a call.
Instead of using the thread queue, you should use a deque protected by a critical section.
http://msdn.microsoft.com/en-us/library/windows/desktop/ms644944(v=vs.85).aspx
There is a limit of 10,000 posted messages per message queue. This limit should be sufficiently large. If your application exceeds the limit, it should be redesigned to avoid consuming so many system resources.
You might maintain a counter of items on/off the queue to see if this is the issue.

Problem using OpenProcess and ReadProcessMemory

I'm having some problems implementing an algorithm to read a foreign process' memory. Here is the main code:
System.Diagnostics.Process.EnterDebugMode();
IntPtr retValue = WinApi.OpenProcess((int)WinApi.OpenProcess_Access.VMRead | (int)WinApi.OpenProcess_Access.QueryInformation, 0, (uint)_proc.Id);
_procHandle = retValue;
WinApi.MEMORY_BASIC_INFORMATION[] mbia = getMemoryBasicInformation().Where(p => p.State == 0x1000).ToArray();
foreach (WinApi.MEMORY_BASIC_INFORMATION mbi in mbia) {
byte[] buffer = Read((IntPtr)mbi.BaseAddress, mbi.RegionSize);
foreach (IntPtr addr in ByteSearcher.FindInBuffer(buffer, toFind, (IntPtr)0, mbi.RegionSize, increment)) {
yield return addr;
}
}
Read() ... method
if (!WinApi.ReadProcessMemory(_procHandle, address, buffer, size, out numberBytesRead)) {
throw new MemoryReaderException(
string.Format(
"There was an error with ReadProcessMemory()\nGetLastError() = {0}",
WinApi.GetLastError()
));
}
Although generally it seems to work correctly, the problem is that for some memory values ReadProcessMemory is returning false, and GetLastError is returning 299. From what I've googled, it seems to happen on vista because some params of OpenProcess were updated. Anyone knows what this is about? And what values should I try? Notice that as they changed, I wouldn't want to know if it's VM_READ or so, I want to know exactly what the values are.
EDIT: maybe it has something to do with not calling VirtualProtect()/VirtualProtectEx()? as seen on this SO url: WriteProcessMemory/ReadProcessMemory fail
Edit2: That was it! ^^ That is the solution, calling to VirtualProtectEx() first and after ReadProcessMemory()!
C:\Debuggers>kd -z C:\Windows\notepad.exe
0:000> !error 0n299
Error code: (Win32) 0x12b (299) - Only part of a ReadProcessMemory
or WriteProcessMemory request was completed.
This means you tried to read a block that was partially unmapped addresses (i.e. if the app itself did this, it'd AV)
You store the handle to the newly opened process in a local variable (retValue), but you don't pass it to your getMemoryBasicInformation function, so I can only assume that it actually fetches information about the current process. I suspect you're really using your own process's address ranges as though they belong to the other process. Many of the address ranges will probably be the same between processes, so that error wouldn't be immediately apparent.

Categories